Reputation: 1405
I am using spark sql to read 2 different datasets sitting in ORC format in S3 . But the performance difference in reading is huge for almost similar size datasets.
Dataset 1 : contains 212,000,000 records each with 50 columns and total up to 15GB in orc format in s3 bucket .
Dataset 2 : contains 29,000,000 records each with 150 columns and total up to 15GB in orc format in same s3 bucket .
Dataset 1 is taking 2 mins to read using spark sql. And its taking 12 mins to read Dataset 2 with same spark read/count job in same infrastructure.
Length of each row could cause this big difference. Can anyone help me understand the reason behind huge performance difference in reading these datasets ?
Upvotes: 3
Views: 2249
Reputation: 13490
Assuming you are using the s3a: client (and not Amazon EMR & it's s3:// client) it is about how much seek() work is going on and whether the client is being clever about random IO or not. Essentially: seek() is very expensive over HTTP1.1 GETs if you have to close an HTTP connection and create a new one. Hadoop 2.8+ adds two features for this: HADOOP-14244: Lazy seek, and HADOOP-13203. High performance random IO.
If you have the Hadoop 2.8.+ JARs on your classopath, go:
spark.hadoop.fs.s3a.experimental.input.fadvise random
This will hurt performance on non-random IO (reading .gz files and the like), but is critical for ORC/Parquet IO perf.
If you are using Amazon EMR, their s3 client is closed source, take it up with their support team I'm afraid.
Upvotes: 3