Reputation: 1405
I have 30GB ORC files ( 24 parts * 1.3G) in s3. I am using spark to read this orc and do some operations. But from the logs what I observed was even before doing any operation, spark is opening and reading all 24 parts from s3 (Taking 12 min just to read files ). But my concern here is that all this read operations are happening only in driver and executors are all idle at this time.
Can someone explain me why is happening? Is there any way I can utilize all executors for reading as well?
Does the same apply for parquet as well ?
Thanks in advance.
Upvotes: 2
Views: 959
Reputation: 574
Have you provided the schema of your data ?
If not, Spark tries to get the schema of all the files, and then proceeds with the execution.
Upvotes: 4
Reputation: 13430
Both ORC and Parquet can do checks for summary data in the footers of files, and, depending on the s3 client and its config, may cause it to do some very inefficient IO. This may be the cause.
If you are using the s3a:// connector and the underlying JARs of Hadoop 2.8+ then you can tell it to the random IO needed for maximum performance on columnar data, and tune some other things.
val OPTIONS = Map(
"spark.hadoop.fs.s3a.experimental.fadvise" => "random"
"spark.hadoop.orc.splits.include.file.footer" -> "true",
"spark.hadoop.orc.cache.stripe.details.size" -> "1000",
"spark.hadoop.orc.filterPushdown" -> "true"
"spark.sql.parquet.mergeSchema" -> "false",
"spark.sql.parquet.filterPushdown" -> "true"
)
Upvotes: 3