Spark read performance difference in same size but with different row lengths

Question

I am using spark sql to read 2 different datasets sitting in ORC format in S3 . But the performance difference in reading is huge for almost similar size datasets.

Dataset 1 : contains 212,000,000 records each with 50 columns and total up to 15GB in orc format in s3 bucket .

Dataset 2 : contains 29,000,000 records each with 150 columns and total up to 15GB in orc format in same s3 bucket .

Dataset 1 is taking 2 mins to read using spark sql. And its taking 12 mins to read Dataset 2 with same spark read/count job in same infrastructure.

Length of each row could cause this big difference. Can anyone help me understand the reason behind huge performance difference in reading these datasets ?

Spark read performance difference in same size but with different row lengths

Answers (1)

Related Questions