I have a doubt in some codes I have been reading. They referring "partitions" as "maps" (thinking as the MapReduce style) in the same way: --total-executor-cores #maps is the number of maps. var data = sc.textFile(inputFile, nPartitions) The code comment says " nPartitions is the number of the maps" So, conceptually, are they the same?

Reputation: 742

How to set the number of "mappers"/partitions in Spark

I have a doubt in some codes I have been reading. They referring "partitions" as "maps" (thinking as the MapReduce style) in the same way:

--total-executor-cores #maps is the number of maps.
var data = sc.textFile(inputFile, nPartitions) The code comment says "nPartitions is the number of the maps"

So, conceptually, are they the same?

Upvotes: 1

Answers (2)

Ehud Lev

Reputation: 2901

In order to control specific partitioning of an RDD you can use "repartition" method or "coalesce" method. If you want to be have it on all rdds for all mappers you should use: sparkConf.set("spark.default.parallelism", s"${numbers of mappers you want}") If you want to controll the shuffle (reducers) sparkConf.set("spark.sql.shuffle.partitions", s"${numbers of reducers you want}")

Number of cores is the number of cores you assign to job in a cluster.

Upvotes: 0

firsni

Reputation: 906

You're correct. the number of cores is mapped to the number of tasks that you can compute in ||. this number is fixed. But the number of partitions varies along the job. for each partiton, we have a task and a task is processed by a core. The number of partitions defines the number of tasks.

Upvotes: 1

How to set the number of &quot;mappers&quot;/partitions in Spark

Answers (2)

Related Questions

How to set the number of "mappers"/partitions in Spark