Number of tasks not equal to number of partitions in Spark

Question

I have a Spark application that does the following

Read files from S3
Group above data by a 'key' and generate counts per-key
Persist key-value pairs to DB

I have modeled the problem as follows

Obtain list of files in the driver program and use sc.parallelize to generate a RDD of filenames. I am trying to control the numberOfPartitions here by using sc.parallelize(filenameArray, sizeOfFilenameArray) - let's call this the filenamesRDD
Download contents for each file from S3 in parallel and map to user defined objects - let's call this RDD the objectsRDD
Generate a pairRDD from the objectsRDD
Use reduceByKey to obtain counts per key - let's call this RDD the countsRDD. Currently due to a bug, I have numberOfPartitions for the countsRDD set to 1
Use foreachPartition to persist countsRDD to a DB

I have two environments where I am running the application

Test - 1 Machine with 2 CPUs. Value for spark.default.parallelism = 4
Prod - 2 Machines with 16 CPUs each. Value for spark.default.parallelism = 32

As expected, my job executes in two stages

Stage 1 : filenamesRDD -> objectsRDD -> pairRDD
Stage 2 : pairRDD -> countsRDD -> persistToDB

I am observing that in my Prod environment, the numberOfTasks generated for both Stages 1 and 2 doesn't equal the numberOfParitions in the corresponding RDDs. I confirmed the value for numberOfPartitions by printing it out. Here is an example

numberOfFiles = 100

Test Environment

Stage1
- Expectation : numberOfTasks = 100, numberOfParitions = 100 for objectsRDD and pairRDD
- Observation : Matches Expectation
Stage2
- Expectation : numberOfTasks = 1, numberOfPartitions = 1 for countsRDD
- Reality : Matches Expectation

Prod Environment

Stage1
- Expectation : numberOfTasks = 100, numberOfPartitions = 100 for objectsRDD and pairRDD
- Observation : numberOfTasks = 16, numberOfPartitions = 100 for objectsRDD and pairRDD
Stage2
- Expectation : numberOfTasks = 1, numberOfPartitions = 1 for countsRDD
- Observation : numberOfTasks = 16, numberOfPartitions = 1 for countsRDD

I have read through a lot of material and nowhere have I seen instances and explanations where the numberOfPartitions != numberOfTasks. Could someone help figure out what is going on.

chadum · Accepted Answer

It is possible that the two environments have different configuration values. You can view the configurations in the History page "Environment" tab. I suggest comparing the Test and Prod environment settings.

Number of tasks not equal to number of partitions in Spark

Answers (1)

Related Questions