witchking
witchking

Reputation: 385

Number of tasks not equal to number of partitions in Spark

I have a Spark application that does the following

  1. Read files from S3
  2. Group above data by a 'key' and generate counts per-key
  3. Persist key-value pairs to DB

I have modeled the problem as follows

  1. Obtain list of files in the driver program and use sc.parallelize to generate a RDD of filenames. I am trying to control the numberOfPartitions here by using sc.parallelize(filenameArray, sizeOfFilenameArray) - let's call this the filenamesRDD
  2. Download contents for each file from S3 in parallel and map to user defined objects - let's call this RDD the objectsRDD
  3. Generate a pairRDD from the objectsRDD
  4. Use reduceByKey to obtain counts per key - let's call this RDD the countsRDD. Currently due to a bug, I have numberOfPartitions for the countsRDD set to 1
  5. Use foreachPartition to persist countsRDD to a DB

I have two environments where I am running the application

As expected, my job executes in two stages

I am observing that in my Prod environment, the numberOfTasks generated for both Stages 1 and 2 doesn't equal the numberOfParitions in the corresponding RDDs. I confirmed the value for numberOfPartitions by printing it out. Here is an example

numberOfFiles = 100

Test Environment

Prod Environment

I have read through a lot of material and nowhere have I seen instances and explanations where the numberOfPartitions != numberOfTasks. Could someone help figure out what is going on.

Upvotes: 1

Views: 1225

Answers (1)

chadum
chadum

Reputation: 338

It is possible that the two environments have different configuration values. You can view the configurations in the History page "Environment" tab. I suggest comparing the Test and Prod environment settings.

Upvotes: 2

Related Questions