Чтение файлов по нескольким путям в хранилище BLOB-объектов Azure с использованием Spark занимает очень много времени
%livy
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
val spark = (SparkSession
.builder
.appName("blob")
.getOrCreate())
spark.conf.set("fs.azure.xxx.xxx.blob.core.xx.xxx","xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
val df = spark.read.parquet("wasb://xxxxxxxxxx@xxxxxxxx.blob.core.xxxx.xxx/root/date=20180801/id_*_*/user=*/*.parquet")
//ex: "wasb://xxxxxxxxxx@xxxxxxxx.blob.core.xxxx.xxx/root/date=20180801/id_01_98834_343434_3535_353835/user=jhon1/part0001_00201_23234_46734_3435234.parquet" it is "multipath"
Чтение занимает 1 минуту 27 секунд!
Общее количество найденных путей составляет около 82, а общий объем данных составляет менее 1 МБ! Когда я читаю один путь, это занимает 1 сек. Живой переводчик имеет 2 исполнителей. Я попробовал также с pyspark, и это занимает 1мин27сек с 4 исполнителями. Он должен работать через несколько секунд для такого низкого набора данных.
После прочтения файлов журнала сканирование / генерация пути и составление списка занимают 2 минуты 49 секунд, после перечисления 70 путей поиск задания занимает 11 секунд. Речь идет о искре или о хранении голубых капель, как исследовать это дальше?
Почему сканирование путей происходит так медленно?
Обновление после комментариев: если он сканирует весь контейнер, возможно, будет лучше, если мы сможем установить wasb://xxxxxxxxxx@xxxxxxxx.blob.core.xxxx.xxx/root/date=20180801/
как temp_home_path
и просканируйте следующий путь следующим образом: val df = spark.read.parquet("/id_*_*/user=*/*.parquet")
Как я могу это сделать?
Обновление 2
Я использовал Hadoop fs, чтобы увидеть, что происходит.
Найти пути, используя globStatus
заняло 33 сек, нет лог файлов.
val src_path = new Path( "wasb://xxxxxxxxxx@xxxxxxxx.blob.core.xxxx.xxx/root/date=20180801/id_*_*/user=*/*.parquet")
val fs = org.apache.hadoop.fs.FileSystem.get(new URI(src_path), new org.apache.hadoop.conf.Configuration)
val pathsArray = fs.globStatus(src_path)
val paths = pathsArray.map(p => p.getPath.toString)// 82 paths
Чтение паркетов заняло 1мин 11сек.
val df = spark.read.parquet(paths: _*)
Лог файлы для чтения
18/08/20 16:15:37 Execute read parquet (added by me)
18/08/20 16:16:26 After 49 sec (added by me)
18/08/20 16:16:26 INFO CoarseGrainedExecutorBackend: Got assigned task 333
18/08/20 16:16:26 INFO Executor: Running task 1.0 in stage 8.0 (TID 333)
18/08/20 16:16:26 INFO TorrentBroadcast: Started reading broadcast variable 8
18/08/20 16:16:26 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 40.5 KB, free 408.9 MB)
18/08/20 16:16:26 INFO TorrentBroadcast: Reading broadcast variable 8 took 18 ms
18/08/20 16:16:26 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 143.5 KB, free 408.7 MB)
18/08/20 16:16:27 INFO Executor: Finished task 1.0 in stage 8.0 (TID 333). 2052 bytes result sent to driver
18/08/20 16:16:27 INFO CoarseGrainedExecutorBackend: Got assigned task 335
18/08/20 16:16:27 INFO Executor: Running task 3.0 in stage 8.0 (TID 335)
18/08/20 16:16:27 INFO Executor: Finished task 3.0 in stage 8.0 (TID 335). 2052 bytes result sent to driver
18/08/20 16:16:27 INFO CoarseGrainedExecutorBackend: Got assigned task 337
18/08/20 16:16:27 INFO Executor: Running task 5.0 in stage 8.0 (TID 337)
18/08/20 16:16:28 INFO Executor: Finished task 5.0 in stage 8.0 (TID 337). 2052 bytes result sent to driver
18/08/20 16:16:28 INFO CoarseGrainedExecutorBackend: Got assigned task 339
18/08/20 16:16:28 INFO Executor: Running task 7.0 in stage 8.0 (TID 339)
18/08/20 16:16:28 INFO Executor: Finished task 7.0 in stage 8.0 (TID 339). 2052 bytes result sent to driver
18/08/20 16:16:28 INFO CoarseGrainedExecutorBackend: Got assigned task 341
18/08/20 16:16:28 INFO Executor: Running task 9.0 in stage 8.0 (TID 341)
18/08/20 16:16:29 INFO Executor: Finished task 9.0 in stage 8.0 (TID 341). 2052 bytes result sent to driver
18/08/20 16:16:29 INFO CoarseGrainedExecutorBackend: Got assigned task 343
18/08/20 16:16:29 INFO Executor: Running task 11.0 in stage 8.0 (TID 343)
18/08/20 16:16:29 INFO Executor: Finished task 11.0 in stage 8.0 (TID 343). 2052 bytes result sent to driver
18/08/20 16:16:29 INFO CoarseGrainedExecutorBackend: Got assigned task 345
18/08/20 16:16:29 INFO Executor: Running task 13.0 in stage 8.0 (TID 345)
18/08/20 16:16:30 INFO Executor: Finished task 13.0 in stage 8.0 (TID 345). 2052 bytes result sent to driver
18/08/20 16:16:30 INFO CoarseGrainedExecutorBackend: Got assigned task 347
18/08/20 16:16:30 INFO Executor: Running task 15.0 in stage 8.0 (TID 347)
18/08/20 16:16:30 INFO Executor: Finished task 15.0 in stage 8.0 (TID 347). 2052 bytes result sent to driver
18/08/20 16:16:30 INFO CoarseGrainedExecutorBackend: Got assigned task 349
18/08/20 16:16:30 INFO Executor: Running task 17.0 in stage 8.0 (TID 349)
18/08/20 16:16:31 INFO Executor: Finished task 17.0 in stage 8.0 (TID 349). 2052 bytes result sent to driver
18/08/20 16:16:31 INFO CoarseGrainedExecutorBackend: Got assigned task 351
18/08/20 16:16:31 INFO Executor: Running task 19.0 in stage 8.0 (TID 351)
18/08/20 16:16:31 INFO Executor: Finished task 19.0 in stage 8.0 (TID 351). 2052 bytes result sent to driver
18/08/20 16:16:31 INFO CoarseGrainedExecutorBackend: Got assigned task 353
18/08/20 16:16:31 INFO Executor: Running task 21.0 in stage 8.0 (TID 353)
18/08/20 16:16:32 INFO Executor: Finished task 21.0 in stage 8.0 (TID 353). 2093 bytes result sent to driver
18/08/20 16:16:32 INFO CoarseGrainedExecutorBackend: Got assigned task 355
18/08/20 16:16:32 INFO Executor: Running task 23.0 in stage 8.0 (TID 355)
18/08/20 16:16:32 INFO Executor: Finished task 23.0 in stage 8.0 (TID 355). 2048 bytes result sent to driver
18/08/20 16:16:32 INFO CoarseGrainedExecutorBackend: Got assigned task 357
18/08/20 16:16:32 INFO Executor: Running task 25.0 in stage 8.0 (TID 357)
18/08/20 16:16:32 INFO Executor: Finished task 25.0 in stage 8.0 (TID 357). 2044 bytes result sent to driver
18/08/20 16:16:32 INFO CoarseGrainedExecutorBackend: Got assigned task 359
18/08/20 16:16:32 INFO Executor: Running task 27.0 in stage 8.0 (TID 359)
18/08/20 16:16:33 INFO Executor: Finished task 27.0 in stage 8.0 (TID 359). 2044 bytes result sent to driver
18/08/20 16:16:33 INFO CoarseGrainedExecutorBackend: Got assigned task 361
18/08/20 16:16:33 INFO Executor: Running task 29.0 in stage 8.0 (TID 361)
18/08/20 16:16:33 INFO Executor: Finished task 29.0 in stage 8.0 (TID 361). 2052 bytes result sent to driver
18/08/20 16:16:33 INFO CoarseGrainedExecutorBackend: Got assigned task 363
18/08/20 16:16:33 INFO Executor: Running task 31.0 in stage 8.0 (TID 363)
18/08/20 16:16:34 INFO Executor: Finished task 31.0 in stage 8.0 (TID 363). 2052 bytes result sent to driver
18/08/20 16:16:34 INFO CoarseGrainedExecutorBackend: Got assigned task 365
18/08/20 16:16:34 INFO Executor: Running task 33.0 in stage 8.0 (TID 365)
18/08/20 16:16:34 INFO Executor: Finished task 33.0 in stage 8.0 (TID 365). 2044 bytes result sent to driver
18/08/20 16:16:34 INFO CoarseGrainedExecutorBackend: Got assigned task 367
18/08/20 16:16:34 INFO Executor: Running task 35.0 in stage 8.0 (TID 367)
18/08/20 16:16:35 INFO Executor: Finished task 35.0 in stage 8.0 (TID 367). 2095 bytes result sent to driver
18/08/20 16:16:35 INFO CoarseGrainedExecutorBackend: Got assigned task 369
18/08/20 16:16:35 INFO Executor: Running task 37.0 in stage 8.0 (TID 369)
18/08/20 16:16:35 INFO Executor: Finished task 37.0 in stage 8.0 (TID 369). 2052 bytes result sent to driver
18/08/20 16:16:35 INFO CoarseGrainedExecutorBackend: Got assigned task 371
18/08/20 16:16:35 INFO Executor: Running task 39.0 in stage 8.0 (TID 371)
18/08/20 16:16:36 INFO Executor: Finished task 39.0 in stage 8.0 (TID 371). 2052 bytes result sent to driver
18/08/20 16:16:36 INFO CoarseGrainedExecutorBackend: Got assigned task 373
18/08/20 16:16:36 INFO Executor: Running task 41.0 in stage 8.0 (TID 373)
18/08/20 16:16:36 INFO Executor: Finished task 41.0 in stage 8.0 (TID 373). 2052 bytes result sent to driver
18/08/20 16:16:36 INFO CoarseGrainedExecutorBackend: Got assigned task 375
18/08/20 16:16:36 INFO Executor: Running task 43.0 in stage 8.0 (TID 375)
18/08/20 16:16:37 INFO Executor: Finished task 43.0 in stage 8.0 (TID 375). 2044 bytes result sent to driver
18/08/20 16:16:37 INFO CoarseGrainedExecutorBackend: Got assigned task 377
18/08/20 16:16:37 INFO Executor: Running task 45.0 in stage 8.0 (TID 377)
18/08/20 16:16:37 INFO Executor: Finished task 45.0 in stage 8.0 (TID 377). 2052 bytes result sent to driver
18/08/20 16:16:37 INFO CoarseGrainedExecutorBackend: Got assigned task 378
18/08/20 16:16:37 INFO Executor: Running task 46.0 in stage 8.0 (TID 378)
18/08/20 16:16:37 INFO Executor: Finished task 46.0 in stage 8.0 (TID 378). 2095 bytes result sent to driver
18/08/20 16:16:37 INFO CoarseGrainedExecutorBackend: Got assigned task 380
18/08/20 16:16:37 INFO Executor: Running task 48.0 in stage 8.0 (TID 380)
18/08/20 16:16:38 INFO Executor: Finished task 48.0 in stage 8.0 (TID 380). 2052 bytes result sent to driver
18/08/20 16:16:38 INFO CoarseGrainedExecutorBackend: Got assigned task 382
18/08/20 16:16:38 INFO Executor: Running task 50.0 in stage 8.0 (TID 382)
18/08/20 16:16:38 INFO Executor: Finished task 50.0 in stage 8.0 (TID 382). 2044 bytes result sent to driver
18/08/20 16:16:38 INFO CoarseGrainedExecutorBackend: Got assigned task 384
18/08/20 16:16:38 INFO Executor: Running task 52.0 in stage 8.0 (TID 384)
18/08/20 16:16:39 INFO Executor: Finished task 52.0 in stage 8.0 (TID 384). 2052 bytes result sent to driver
18/08/20 16:16:39 INFO CoarseGrainedExecutorBackend: Got assigned task 386
18/08/20 16:16:39 INFO Executor: Running task 54.0 in stage 8.0 (TID 386)
18/08/20 16:16:39 INFO Executor: Finished task 54.0 in stage 8.0 (TID 386). 2052 bytes result sent to driver
18/08/20 16:16:39 INFO CoarseGrainedExecutorBackend: Got assigned task 388
18/08/20 16:16:39 INFO Executor: Running task 56.0 in stage 8.0 (TID 388)
18/08/20 16:16:40 INFO Executor: Finished task 56.0 in stage 8.0 (TID 388). 2044 bytes result sent to driver
18/08/20 16:16:40 INFO CoarseGrainedExecutorBackend: Got assigned task 390
18/08/20 16:16:40 INFO Executor: Running task 58.0 in stage 8.0 (TID 390)
18/08/20 16:16:40 INFO Executor: Finished task 58.0 in stage 8.0 (TID 390). 2095 bytes result sent to driver
18/08/20 16:16:40 INFO CoarseGrainedExecutorBackend: Got assigned task 392
18/08/20 16:16:40 INFO Executor: Running task 60.0 in stage 8.0 (TID 392)
18/08/20 16:16:41 INFO Executor: Finished task 60.0 in stage 8.0 (TID 392). 2052 bytes result sent to driver
18/08/20 16:16:41 INFO CoarseGrainedExecutorBackend: Got assigned task 394
18/08/20 16:16:41 INFO Executor: Running task 62.0 in stage 8.0 (TID 394)
18/08/20 16:16:41 INFO Executor: Finished task 62.0 in stage 8.0 (TID 394). 2052 bytes result sent to driver
18/08/20 16:16:41 INFO CoarseGrainedExecutorBackend: Got assigned task 396
18/08/20 16:16:41 INFO Executor: Running task 64.0 in stage 8.0 (TID 396)
18/08/20 16:16:42 INFO Executor: Finished task 64.0 in stage 8.0 (TID 396). 2095 bytes result sent to driver
18/08/20 16:16:42 INFO CoarseGrainedExecutorBackend: Got assigned task 398
18/08/20 16:16:42 INFO Executor: Running task 66.0 in stage 8.0 (TID 398)
18/08/20 16:16:42 INFO Executor: Finished task 66.0 in stage 8.0 (TID 398). 2052 bytes result sent to driver
18/08/20 16:16:42 INFO CoarseGrainedExecutorBackend: Got assigned task 400
18/08/20 16:16:42 INFO Executor: Running task 68.0 in stage 8.0 (TID 400)
18/08/20 16:16:42 INFO Executor: Finished task 68.0 in stage 8.0 (TID 400). 2052 bytes result sent to driver
18/08/20 16:16:42 INFO CoarseGrainedExecutorBackend: Got assigned task 402
18/08/20 16:16:42 INFO Executor: Running task 70.0 in stage 8.0 (TID 402)
18/08/20 16:16:43 INFO Executor: Finished task 70.0 in stage 8.0 (TID 402). 2052 bytes result sent to driver
18/08/20 16:16:43 INFO CoarseGrainedExecutorBackend: Got assigned task 404
18/08/20 16:16:43 INFO Executor: Running task 72.0 in stage 8.0 (TID 404)
18/08/20 16:16:43 INFO Executor: Finished task 72.0 in stage 8.0 (TID 404). 2044 bytes result sent to driver
18/08/20 16:16:43 INFO CoarseGrainedExecutorBackend: Got assigned task 406
18/08/20 16:16:43 INFO Executor: Running task 74.0 in stage 8.0 (TID 406)
18/08/20 16:16:44 INFO Executor: Finished task 74.0 in stage 8.0 (TID 406). 2044 bytes result sent to driver
18/08/20 16:16:44 INFO CoarseGrainedExecutorBackend: Got assigned task 408
18/08/20 16:16:44 INFO Executor: Running task 76.0 in stage 8.0 (TID 408)
18/08/20 16:16:44 INFO Executor: Finished task 76.0 in stage 8.0 (TID 408). 2052 bytes result sent to driver
18/08/20 16:16:44 INFO CoarseGrainedExecutorBackend: Got assigned task 410
18/08/20 16:16:44 INFO Executor: Running task 78.0 in stage 8.0 (TID 410)
18/08/20 16:16:45 INFO Executor: Finished task 78.0 in stage 8.0 (TID 410). 2052 bytes result sent to driver
18/08/20 16:16:45 INFO CoarseGrainedExecutorBackend: Got assigned task 412
18/08/20 16:16:45 INFO Executor: Running task 80.0 in stage 8.0 (TID 412)
18/08/20 16:16:45 INFO Executor: Finished task 80.0 in stage 8.0 (TID 412). 2052 bytes result sent to driver
18/08/20 16:16:46 INFO CoarseGrainedExecutorBackend: Got assigned task 414
18/08/20 16:16:46 INFO Executor: Running task 0.0 in stage 9.0 (TID 414)
18/08/20 16:16:46 INFO TorrentBroadcast: Started reading broadcast variable 9
18/08/20 16:16:46 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 40.6 KB, free 408.7 MB)
18/08/20 16:16:46 INFO TorrentBroadcast: Reading broadcast variable 9 took 9 ms
18/08/20 16:16:46 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 143.5 KB, free 408.7 MB)
18/08/20 16:16:46 INFO Executor: Finished task 0.0 in stage 9.0 (TID 414). 4215 bytes result sent to driver
Общее время с Hadoop fs составляет 1 мин 42 сек.
Если я использую подстановочные знаки, это займет 1мин 27 секунд.
val df = spark.read.parquet("wasb://xxxxxxxxxx@xxxxxxxx.blob.core.xxxx.xxx/root/date=20180801/id_*_*/user=*/*.parquet")
Работа с Spark занимает 20 секунд, а у лазурки hadoop+ - 50 секунд.Почему это происходит и как я могу улучшить производительность?