Why are the results of RDD.getNumPartitions and RDD.mapPartitions different?

Question

I'm trying to understand why i don't have the same result for partioning count between these two methods :

val rdd: RDD[Int] = sparkSession.sparkContext.parallelize(0 to 9)
println("S1---> " +rdd.getNumPartitions )
val partitionSizes: Array[Int] = rdd.mapPartitions(iter => Iterator(iter.length)).collect()
partitionSizes.foreach((row: Int) => {
  println("S2---> " +row )
})

Here my result :

S1---> 1

S2---> 10

Why?

Ged · Accepted Answer

So, a few things going on here.

Your default.parallelism is set to 1. Not sure how, but running few resources. You can see that as S1---> 1 states 1 partition, and S2--->10 states 10 items in 1 partition. Seems to add up.

Using databricks notebook, take note pls:

val rdd: RDD[Int] = spark.sparkContext.parallelize(0 to 9)
println("S1---> " +rdd.getNumPartitions )
//S1---> 8
sc.defaultParallelism
//res9: Int = 8, confirms S1 answer

val partitionSizes: Array[Int] = rdd.mapPartitions(iter => Iterator(iter.length)).collect()
//partitionSizes: Array[Int] = Array(1, 1, 1, 2, 1, 1, 1, 2); the 8 partitions and their count

// Print of the above simply, per row
partitionSizes.foreach((row: Int) => {
   println("S2---> " +row )
})

// Count of items per partition
S2---> 1
S2---> 1
S2---> 1
S2---> 2
S2---> 1
S2---> 1
S2---> 1
S2---> 2

So, there are two things going on here --> number of partitions and count per partition. Title is not quite correct.

Why are the results of RDD.getNumPartitions and RDD.mapPartitions different?

Answers (1)

Related Questions