Rijo Joseph
Rijo Joseph

Reputation: 1405

Spark read performance difference in same size but with different row lengths

I am using spark sql to read 2 different datasets sitting in ORC format in S3 . But the performance difference in reading is huge for almost similar size datasets.

Dataset 1 : contains 212,000,000 records each with 50 columns and total up to 15GB in orc format in s3 bucket .

Dataset 2 : contains 29,000,000 records each with 150 columns and total up to 15GB in orc format in same s3 bucket .

Dataset 1 is taking 2 mins to read using spark sql. And its taking 12 mins to read Dataset 2 with same spark read/count job in same infrastructure.

Length of each row could cause this big difference. Can anyone help me understand the reason behind huge performance difference in reading these datasets ?

Upvotes: 3

Views: 2249

Answers (1)

stevel
stevel

Reputation: 13490

Assuming you are using the s3a: client (and not Amazon EMR & it's s3:// client) it is about how much seek() work is going on and whether the client is being clever about random IO or not. Essentially: seek() is very expensive over HTTP1.1 GETs if you have to close an HTTP connection and create a new one. Hadoop 2.8+ adds two features for this: HADOOP-14244: Lazy seek, and HADOOP-13203. High performance random IO.

If you have the Hadoop 2.8.+ JARs on your classopath, go:

spark.hadoop.fs.s3a.experimental.input.fadvise random

This will hurt performance on non-random IO (reading .gz files and the like), but is critical for ORC/Parquet IO perf.

If you are using Amazon EMR, their s3 client is closed source, take it up with their support team I'm afraid.

Upvotes: 3

Related Questions