Reputation: 742
I have a doubt in some codes I have been reading. They referring "partitions" as "maps" (thinking as the MapReduce
style) in the same way:
--total-executor-cores #maps
is the number of maps.var data = sc.textFile(inputFile, nPartitions)
The code comment says "nPartitions
is the number of the maps"So, conceptually, are they the same?
Upvotes: 1
Views: 2230
Reputation: 2901
In order to control specific partitioning of an RDD you can use "repartition" method or "coalesce" method. If you want to be have it on all rdds for all mappers you should use: sparkConf.set("spark.default.parallelism", s"${numbers of mappers you want}") If you want to controll the shuffle (reducers) sparkConf.set("spark.sql.shuffle.partitions", s"${numbers of reducers you want}")
Number of cores is the number of cores you assign to job in a cluster.
Upvotes: 0
Reputation: 906
You're correct. the number of cores is mapped to the number of tasks that you can compute in ||. this number is fixed. But the number of partitions varies along the job. for each partiton, we have a task and a task is processed by a core. The number of partitions defines the number of tasks.
Upvotes: 1