Random Partitioner behavior on the joined RDD

Question

I am trying to join two data sets. One of type (Id, salesRecord) another (Id,Name). First data-set is partitioned by HashPartitioner and Second is partitioned by Custom Partitioner. When I join these RDDs by id and try to see which partition-er information is retained I randomly see that some times joinRDD displays custom partitioner and sometimes HashPartitioner. I received different partioner results while changing the number of partitions also.

According to the Learning Spark book, rdd1.join(rdd2) retains the partition info from the rdd1.

Here is the code.

  val hashPartitionedRDD = cusotmerIDSalesRecord.partitionBy(new HashPartitioner(10))
println("hashPartitionedRDD's partitioner " + hashPartitionedRDD.partitioner) // Seeing Instance of HashParitioner

val customPartitionedRDD = customerIdNamePair1.partitionBy(new CustomerPartitioner)
println("customPartitionedRDD partitioner " + customPartitionedRDD.partitioner) // Seeing instance of CustomPartitioner

val expectedHash = hashPartitionedRDD.join(customPartitionedRDD)
val expectedCustom = customPartitionedRDD.join(hashPartitionedRDD)

println("Expected Hash " + expectedHash.partitioner) // Seeing instance of Custom Partitioner
println("Expected Custom " + expectedCustom.partitioner) //Seeing instance of Custom Partitioner

// Just to add more to it when number of partitions of both the data sets I made equal I am seeing the reverse results. i.e. 
// expectedHash shows CustomPartitioner and 
// expectedCustom shows Hashpartitioner Instance.

Random Partitioner behavior on the joined RDD

Answers (1)

Related Questions