Reputation: 2390
I am wondering how to choose the best settings to run tune me Spark Job.
Basically I am just reading a big csv
file into a DataFrame
and count some string occurrences.
The input file is over 500 GB. The spark job seems too slow..
terminal Progress Bar:
[Stage1:=======> (4174 + 50) / 18500]
NumberCompletedTasks:
(4174) takes around one hour.
NumberActiveTasks:
(50), I believe I can control with setting.
--conf spark.dynamicAllocation.maxExecutors=50
(tried with different values).
TotalNumberOfTasks:
(18500), why is this fixed? what does it mean, is it only related to file size?
Since I am reading a csv
just with little logic, how can I optimize the Spark Job?
I also tried changing :
--executor-memory 10g
--driver-memory 12g
Upvotes: 1
Views: 1500
Reputation: 470
Number of partitions = Number of tasks. If you have 18500 partitions, then spark will run 18500 tasks to process those.
Are you just reading the file and doing a filter out of it? Do you perform any Wide transformations? If you perform wide transformation, then the number of partition in the resulting RDD is controlled by the property "spark.sql.shuffle.partitions". If this is set to 18500, then your resultant RDD will have 18500 partitions and as a result 18500 tasks.
Secondly, spark.dynamicAllocation.maxExecutors represents the Upper bound for the number of executors if dynamic allocation is enabled. From what I can see, you have 5 nodes, 10 executors per node [Total 50 executors] and 1 core per executor [If you are running in YARN, then 1 core per executor is default].
To run your job faster: If possible reduce the number of shuffle partition via property spark.sql.shuffle.partitions and increase the number of core per executor [5 core per executor is the recommended one].
Upvotes: 1
Reputation: 1175
The Number of tasks depends upon number of partitions of your source RDD, in your case you are reading from HDFS, the block size decides number of partitions thus the number of tasks, it won't be based on number of executors, if you want to increase/decrease number of tasks you need to alter partitions, in your case you need to override HDFS configurations min/maxSplit size when reading, for an existing RDD we can use repartion/coalesce to do the same.
Upvotes: 1