How to properly (scale-ably) read many ORC files into spark

Question

I'd like to use EMR and Spark to process an AWS S3 inventory report generated in ORC format that has many ORC files (hundreds) and the total size of all the data is around 250GB.

Is there a specific or best practice way to read all the files in to one Dataset? It seems like I can pass the sqlContext.read().orc() method a list of files, but I wasn't sure if this would scale/parallelize properly if I pass it a large list of hundreds of files.

What is the best practice way of doing this? Ultimately my goal is to have the contents of all the files in one dataset so that I can run a sql query on the dataset and then call .map on the results for subsequent processing on that result set.

Thanks in advance for your suggestions.

Yuriy Bondaruk · Accepted Answer

Just specify a folder where your orc files are located. Spark will automatically detect all of them and will put into a single DataFrame.

sparkSession.read.orc("s3://bucket/path/to/folder/with/orc/files")

You shouldn't care much about scalability since everything is done by spark based on default config provided by EMR depending on the EC2 instance type selected. You can experiment with number of slave nodes and it's instance type though.

Besides that, I would suggest to set maximizeResourceAllocation to true to configure executors to utilize maximum resources on each slave node.

How to properly (scale-ably) read many ORC files into spark

Answers (1)

Related Questions