Reputation: 3599
I am creating a RDD by passing a collection to sparkContext
parallelize
method. My question is why it gives me 8 partitions as i have only 3 records . Am i getting some empty partitions
scala> val rdd = sc.parallelize(List("surender","raja","kumar"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:40
scala> rdd.partitions.length
res0: Int = 8
scala> rdd.partitions
res1: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@691,
org.apache.spark.rdd.ParallelCollectionPartition@692,
org.apache.spark.rdd.ParallelCollectionPartition@693,
org.apache.spark.rdd.ParallelCollectionPartition@694,
org.apache.spark.rdd.ParallelCollectionPartition@695,
org.apache.spark.rdd.ParallelCollectionPartition@696,
org.apache.spark.rdd.ParallelCollectionPartition@697,
org.apache.spark.rdd.ParallelCollectionPartition@698)
scala> rdd.getNumPartitions
res2: Int = 8
Upvotes: 0
Views: 84
Reputation: 7928
If you don't provide the number of partitions it will create the defined in spark.default.parallelism
whose value you can check running sc.defaultParallelism
.
It's value depends on where you are running and on the hardware:
According to the documentation (look for spark.default.parallelism
)
it depends on the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger
You can specify the number of partitions with a second parameter in the parallelize
method
For instance:
val rdd = sc.parallelize(List("surender","raja","kumar"), 5)
scala> rdd.partitions.length
res1: Int = 5
scala> sc.defaultParallelism
res2: Int = 4
Upvotes: 1