Tune Spark, set executors and memory driver for reading large csv file

I am wondering how to choose the best settings to run tune me Spark Job. Basically I am just reading a big csv file into a DataFrame and count some string occurrences.

The input file is over 500 GB. The spark job seems too slow..

terminal Progress Bar:

[Stage1:=======>                      (4174 + 50) / 18500]

NumberCompletedTasks: (4174) takes around one hour.

NumberActiveTasks: (50), I believe I can control with setting. --conf spark.dynamicAllocation.maxExecutors=50 (tried with different values).

TotalNumberOfTasks: (18500), why is this fixed? what does it mean, is it only related to file size? Since I am reading a csv just with little logic, how can I optimize the Spark Job?

I also tried changing :

 --executor-memory 10g 
 --driver-memory 12g

Upvotes: 1

Answers (2)

Vinoth Chinnasamy

Reputation: 470

Number of partitions = Number of tasks. If you have 18500 partitions, then spark will run 18500 tasks to process those.

Are you just reading the file and doing a filter out of it? Do you perform any Wide transformations? If you perform wide transformation, then the number of partition in the resulting RDD is controlled by the property "spark.sql.shuffle.partitions". If this is set to 18500, then your resultant RDD will have 18500 partitions and as a result 18500 tasks.

Secondly, spark.dynamicAllocation.maxExecutors represents the Upper bound for the number of executors if dynamic allocation is enabled. From what I can see, you have 5 nodes, 10 executors per node [Total 50 executors] and 1 core per executor [If you are running in YARN, then 1 core per executor is default].

To run your job faster: If possible reduce the number of shuffle partition via property spark.sql.shuffle.partitions and increase the number of core per executor [5 core per executor is the recommended one].

Upvotes: 1

SanthoshPrasad

Reputation: 1175

The Number of tasks depends upon number of partitions of your source RDD, in your case you are reading from HDFS, the block size decides number of partitions thus the number of tasks, it won't be based on number of executors, if you want to increase/decrease number of tasks you need to alter partitions, in your case you need to override HDFS configurations min/maxSplit size when reading, for an existing RDD we can use repartion/coalesce to do the same.

Upvotes: 1

Tune Spark, set executors and memory driver for reading large csv file

Answers (2)

Related Questions