Reputation: 385
I have 3 folders containing csv files in 3 different schemas in HDFS.All 3 files are huge ( several GBs). I want to read the files in parallel and process the rows in them in parallel. How do I accomplish this is on a yarn cluster using Spark?
Upvotes: 2
Views: 4222
Reputation: 91
I had encountered similar situation recently.
You can pass a list of CSVs with their paths to spark read api like spark.read.json(input_file_paths)
(source). This will load all the files in a single dataframe and all the transformations eventually performed will be done in parallel by multiple executors depending on your spark config.
Upvotes: 1
Reputation: 1518
Assuming, you are using Scala, create a parallel collection of your files using the hdfs client and the .par
convenience method, then map the result onto spark.read
and call an action -- voilà, if you have enough resources in the cluster, you'll have all files being read in parallel. At worst, Spark's job scheduler will shuffle the execution of certain tasks around to minimize wait times.
If you don't have enough workers/executors, you won't gain much, but if you do, you can fully exploit those resources, without having to wait for each job to finish, before you send out the next.
Due to lazy evaluation this may happen anyway, depending on how you work with the data -- but you can force parallel execution of several actions/jobs by using parallelism or Futures.
Upvotes: 2
Reputation: 6974
If you want to process all the data separately, you can always write 3 spark jobs to process them separately and execute them in the cluster in parallel. There are several way to run all 3 jobs in parallel. The most straight forward is to have a oozie workflow with 3 parallel sub-workflow.
Now if you want to process 3 datasets in the same job, you need to read them sequentially. After that you can process the datasets. When you process multiple datasets using spark operation, Spark parallelize them for you. The closure of the operation will be shipped to the executors and all will work in parallel.
Upvotes: 0
Reputation: 106
What do you mean under "read the files in parallel and process the rows in them in parallel"? Spark deals with your data in parallel itself according to your application configuration (num-executors, executor-cores...). If you mean 'start reading files at the same time and process simultaneously', I'm pretty sure, you can't explicitly get it. It would demand some capabilities to affect the DAG of your application, but as I know, the only way to do it is implicitly, when building your data process as a sequence of transformations/actions. Spark is also designed in such way, that it can execute several stages simultaneously "out of box", if your resource allocation allows.
Upvotes: 0