Reputation: 353
I'm trying to understand why i don't have the same result for partioning count between these two methods :
val rdd: RDD[Int] = sparkSession.sparkContext.parallelize(0 to 9)
println("S1---> " +rdd.getNumPartitions )
val partitionSizes: Array[Int] = rdd.mapPartitions(iter => Iterator(iter.length)).collect()
partitionSizes.foreach((row: Int) => {
println("S2---> " +row )
})
Here my result :
S1---> 1
S2---> 10
Why?
Upvotes: 0
Views: 345
Reputation: 18098
So, a few things going on here.
Your default.parallelism is set to 1. Not sure how, but running few resources. You can see that as S1---> 1 states 1 partition, and S2--->10 states 10 items in 1 partition. Seems to add up.
Using databricks notebook, take note pls:
val rdd: RDD[Int] = spark.sparkContext.parallelize(0 to 9)
println("S1---> " +rdd.getNumPartitions )
//S1---> 8
sc.defaultParallelism
//res9: Int = 8, confirms S1 answer
val partitionSizes: Array[Int] = rdd.mapPartitions(iter => Iterator(iter.length)).collect()
//partitionSizes: Array[Int] = Array(1, 1, 1, 2, 1, 1, 1, 2); the 8 partitions and their count
// Print of the above simply, per row
partitionSizes.foreach((row: Int) => {
println("S2---> " +row )
})
// Count of items per partition
S2---> 1
S2---> 1
S2---> 1
S2---> 2
S2---> 1
S2---> 1
S2---> 1
S2---> 2
So, there are two things going on here --> number of partitions and count per partition. Title is not quite correct.
Upvotes: 2