Spark + Parquet + S3n : Seems to read parquet file many times

Question

I have the parquet files in Hive-like partitioned way on S3n bucket. The metadata files are not created, the parquet footers are in the file itself.

When I tried a sample spark job in local mode (v-1.6.0) trying to read a file of size 5.2 MB:

val filePath = "s3n://bucket/trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet"
val path: Path = new Path(filePath)

val conf = new SparkConf().setMaster("local[2]").set("spark.app.name", "parquet-reader-s3n").set("spark.eventLog.enabled", "true")    
val sc = new SparkContext(conf)
val sqlc = new org.apache.spark.sql.SQLContext(sc)

val df = sqlc.read.parquet(filePath).select("referenceCode")
Thread.sleep(1000*10) // Intentionally given
println(df.schema)
val output = df.collect

The log generated is:

..
[22:21:56.505][main][INFO][BlockManagerMaster:58] Registered BlockManager
[22:21:56.909][main][INFO][EventLoggingListener:58] Logging events to file:/tmp/spark-events/local-1463676716372
[22:21:57.307][main][INFO][ParquetRelation:58] Listing s3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet on driver
[22:21:59.927][main][INFO][SparkContext:58] Starting job: parquet at InspectInputSplits.scala:30
[22:21:59.942][dag-scheduler-event-loop][INFO][DAGScheduler:58] Got job 0 (parquet at InspectInputSplits.scala:30) with 2 output partitions
[22:21:59.942][dag-scheduler-event-loop][INFO][DAGScheduler:58] Final stage: ResultStage 0 (parquet at InspectInputSplits.scala:30)
[22:21:59.943][dag-scheduler-event-loop][INFO][DAGScheduler:58] Parents of final stage: List()
[22:21:59.944][dag-scheduler-event-loop][INFO][DAGScheduler:58] Missing parents: List()
[22:21:59.954][dag-scheduler-event-loop][INFO][DAGScheduler:58] Submitting ResultStage 0 (MapPartitionsRDD[1] at parquet at InspectInputSplits.scala:30), which has no missing parents
[22:22:00.218][dag-scheduler-event-loop][INFO][MemoryStore:58] Block broadcast_0 stored as values in memory (estimated size 64.5 KB, free 64.5 KB)
[22:22:00.226][dag-scheduler-event-loop][INFO][MemoryStore:58] Block broadcast_0_piece0 stored as bytes in memory (estimated size 21.7 KB, free 86.2 KB)
[22:22:00.229][dispatcher-event-loop-0][INFO][BlockManagerInfo:58] Added broadcast_0_piece0 in memory on localhost:54419 (size: 21.7 KB, free: 1088.2 MB)
[22:22:00.231][dag-scheduler-event-loop][INFO][SparkContext:58] Created broadcast 0 from broadcast at DAGScheduler.scala:1006
[22:22:00.234][dag-scheduler-event-loop][INFO][DAGScheduler:58] Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at parquet at InspectInputSplits.scala:30)
[22:22:00.235][dag-scheduler-event-loop][INFO][TaskSchedulerImpl:58] Adding task set 0.0 with 2 tasks
[22:22:00.278][dispatcher-event-loop-1][INFO][TaskSetManager:58] Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2076 bytes)
[22:22:00.281][dispatcher-event-loop-1][INFO][TaskSetManager:58] Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2395 bytes)
[22:22:00.290][Executor task launch worker-0][INFO][Executor:58] Running task 0.0 in stage 0.0 (TID 0)
[22:22:00.291][Executor task launch worker-1][INFO][Executor:58] Running task 1.0 in stage 0.0 (TID 1)
[22:22:00.425][Executor task launch worker-1][INFO][ParquetFileReader:151] Initiating action with parallelism: 5
[22:22:00.447][Executor task launch worker-0][INFO][ParquetFileReader:151] Initiating action with parallelism: 5
[22:22:00.463][Executor task launch worker-0][INFO][Executor:58] Finished task 0.0 in stage 0.0 (TID 0). 936 bytes result sent to driver
[22:22:00.471][task-result-getter-0][INFO][TaskSetManager:58] Finished task 0.0 in stage 0.0 (TID 0) in 213 ms on localhost (1/2)
[22:22:00.586][pool-20-thread-1][INFO][NativeS3FileSystem:619] Opening 's3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet' for reading
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[22:22:25.890][Executor task launch worker-1][INFO][Executor:58] Finished task 1.0 in stage 0.0 (TID 1). 4067 bytes result sent to driver
[22:22:25.898][task-result-getter-1][INFO][TaskSetManager:58] Finished task 1.0 in stage 0.0 (TID 1) in 25617 ms on localhost (2/2)
[22:22:25.898][dag-scheduler-event-loop][INFO][DAGScheduler:58] ResultStage 0 (parquet at InspectInputSplits.scala:30) finished in 25.656 s
[22:22:25.899][task-result-getter-1][INFO][TaskSchedulerImpl:58] Removed TaskSet 0.0, whose tasks have all completed, from pool 
[22:22:25.905][main][INFO][DAGScheduler:58] Job 0 finished: parquet at InspectInputSplits.scala:30, took 25.977801 s
StructType(StructField(referenceCode,StringType,true))
[22:22:36.271][main][INFO][DataSourceStrategy:58] Selected 1 partitions out of 1, pruned 0.0% partitions.
[22:22:36.325][main][INFO][MemoryStore:58] Block broadcast_1 stored as values in memory (estimated size 89.3 KB, free 175.5 KB)
[22:22:36.389][main][INFO][MemoryStore:58] Block broadcast_1_piece0 stored as bytes in memory (estimated size 20.2 KB, free 195.7 KB)
[22:22:36.389][dispatcher-event-loop-0][INFO][BlockManagerInfo:58] Added broadcast_1_piece0 in memory on localhost:54419 (size: 20.2 KB, free: 1088.2 MB)
[22:22:36.391][main][INFO][SparkContext:58] Created broadcast 1 from collect at InspectInputSplits.scala:34
[22:22:36.520][main][INFO][deprecation:1174] mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
[22:22:36.522][main][INFO][ParquetRelation:58] Reading Parquet file(s) from s3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet
[22:22:36.554][main][INFO][SparkContext:58] Starting job: collect at InspectInputSplits.scala:34
[22:22:36.556][dag-scheduler-event-loop][INFO][DAGScheduler:58] Got job 1 (collect at InspectInputSplits.scala:34) with 1 output partitions
[22:22:36.556][dag-scheduler-event-loop][INFO][DAGScheduler:58] Final stage: ResultStage 1 (collect at InspectInputSplits.scala:34)
[22:22:36.556][dag-scheduler-event-loop][INFO][DAGScheduler:58] Parents of final stage: List()
[22:22:36.557][dag-scheduler-event-loop][INFO][DAGScheduler:58] Missing parents: List()
[22:22:36.557][dag-scheduler-event-loop][INFO][DAGScheduler:58] Submitting ResultStage 1 (MapPartitionsRDD[4] at collect at InspectInputSplits.scala:34), which has no missing parents
[22:22:36.571][dag-scheduler-event-loop][INFO][MemoryStore:58] Block broadcast_2 stored as values in memory (estimated size 7.6 KB, free 203.3 KB)
[22:22:36.575][dag-scheduler-event-loop][INFO][MemoryStore:58] Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.0 KB, free 207.3 KB)
[22:22:36.576][dispatcher-event-loop-1][INFO][BlockManagerInfo:58] Added broadcast_2_piece0 in memory on localhost:54419 (size: 4.0 KB, free: 1088.2 MB)
[22:22:36.577][dag-scheduler-event-loop][INFO][SparkContext:58] Created broadcast 2 from broadcast at DAGScheduler.scala:1006
[22:22:36.577][dag-scheduler-event-loop][INFO][DAGScheduler:58] Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[4] at collect at InspectInputSplits.scala:34)
[22:22:36.577][dag-scheduler-event-loop][INFO][TaskSchedulerImpl:58] Adding task set 1.0 with 1 tasks
[22:22:36.585][dispatcher-event-loop-3][INFO][TaskSetManager:58] Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2481 bytes)
[22:22:36.586][Executor task launch worker-1][INFO][Executor:58] Running task 0.0 in stage 1.0 (TID 2)
[22:22:36.605][Executor task launch worker-1][INFO][ParquetRelation$$anonfun$buildInternalScan$1$$anon$1:58] Input split: ParquetInputSplit{part: s3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet start: 0 end: 5364897 length: 5364897 hosts: []}
[22:22:38.253][Executor task launch worker-1][INFO][NativeS3FileSystem:619] Opening 's3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet' for reading
[22:23:04.249][Executor task launch worker-1][INFO][NativeS3FileSystem:619] Opening 's3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet' for reading
[22:23:28.337][Executor task launch worker-1][INFO][CodecPool:181] Got brand-new decompressor [.gz]
[22:23:28.400][dispatcher-event-loop-1][INFO][BlockManagerInfo:58] Removed broadcast_0_piece0 on localhost:54419 in memory (size: 21.7 KB, free: 1088.2 MB)
[22:23:28.408][Spark Context Cleaner][INFO][ContextCleaner:58] Cleaned accumulator 1
[22:23:49.993][Executor task launch worker-1][INFO][Executor:58] Finished task 0.0 in stage 1.0 (TID 2). 9376344 bytes result sent to driver
[22:23:50.191][task-result-getter-2][INFO][TaskSetManager:58] Finished task 0.0 in stage 1.0 (TID 2) in 73612 ms on localhost (1/1)
[22:23:50.191][task-result-getter-2][INFO][TaskSchedulerImpl:58] Removed TaskSet 1.0, whose tasks have all completed, from pool 
[22:23:50.191][dag-scheduler-event-loop][INFO][DAGScheduler:58] ResultStage 1 (collect at InspectInputSplits.scala:34) finished in 73.612 s
[22:23:50.195][main][INFO][DAGScheduler:58] Job 1 finished: collect at InspectInputSplits.scala:34, took 73.640193 s

The SparkUI snapshot is:

Questions:

In logs, I can see that the parquet file is seen to be read in total of 3 times. One time by [pool-21-thread-1] thread (on driver) and another two times by [Executor task launch worker-1] thread, which I assume to be worker thread. On debug, I can see that before first read, two s3n requests were made specifically for the footer (it had the http header of content-range), first to get the size of the footer and then to get the footer itself. My question is: When we had the footer information, why [pool-21-thread-1] thread still had to read the entire file? And why the executor thread made 2 requests to read the s3 file?
In the spark UI, It shows that only 670 KB is being taken as input. Since I was not assured this to be true, I looked into network activity and it seems 20+ MB has been received. Snapshot attached shows nearly 5+ MB received data in first read and later on 15+ MB for the 2 reads after Thread.sleep(1000*10). I could not reach the debug point for last 2 reads by [pool-21-thread-1] thread due to IDE issues, so not sure whether the particular column ("referenceCode") is being read or the entire file. I understand that there are overhead network packets at the tcp/udp layers, but 20+ MB seems quite a lot for just one column.

Mohitt · Accepted Answer

After debugging into the application, it turned out that S3N still uses jets3t library but the S3A has a new implementation based on AWS SDK ( Hadoop-10400 )

The hadoop's implementation of NativeS3FileSystem does not support seek (partial content reads) on S3 files. It downloads the whole file first.

EDIT: The scenario was not seen in EMR. On EMR amazon provides a highly optimized S3 connector - emrfs for all schemes which overrides the connector provided by hadoop.

Spark + Parquet + S3n : Seems to read parquet file many times

Answers (1)

Related Questions