Reputation: 385
I have a Spark application that does the following
I have modeled the problem as follows
sc.parallelize
to generate a RDD of filenames. I am trying to control the numberOfPartitions
here by using sc.parallelize(filenameArray, sizeOfFilenameArray)
- let's call this the filenamesRDD
objectsRDD
pairRDD
from the objectsRDD
reduceByKey
to obtain counts per key - let's call this RDD the countsRDD
. Currently due to a bug, I have numberOfPartitions
for the countsRDD
set to 1foreachPartition
to persist countsRDD
to a DBI have two environments where I am running the application
spark.default.parallelism
= 4spark.default.parallelism
= 32As expected, my job executes in two stages
filenamesRDD
-> objectsRDD
-> pairRDD
pairRDD
-> countsRDD
-> persistToDBI am observing that in my Prod environment, the numberOfTasks
generated for both Stages 1 and 2 doesn't equal the numberOfParitions
in the corresponding RDDs. I confirmed the value for numberOfPartitions
by printing it out. Here is an example
numberOfFiles
= 100
Test Environment
Stage1
numberOfTasks
= 100, numberOfParitions
= 100 for objectsRDD
and pairRDD
Stage2
numberOfTasks
= 1, numberOfPartitions
= 1 for countsRDD
Prod Environment
Stage1
numberOfTasks
= 100, numberOfPartitions
= 100 for objectsRDD
and pairRDD
numberOfTasks
= 16, numberOfPartitions
= 100 for objectsRDD
and pairRDD
Stage2
numberOfTasks
= 1, numberOfPartitions
= 1 for countsRDD
numberOfTasks
= 16, numberOfPartitions
= 1 for countsRDD
I have read through a lot of material and nowhere have I seen instances and explanations where the numberOfPartitions != numberOfTasks
. Could someone help figure out what is going on.
Upvotes: 1
Views: 1225
Reputation: 338
It is possible that the two environments have different configuration values. You can view the configurations in the History page "Environment" tab. I suggest comparing the Test and Prod environment settings.
Upvotes: 2