Spark: Force two RDD[Key, Value] with co-located partitions using custom partitioner

Question

I have two RDD[K,V], where K=Long and V=Object. Lets call the rdd1 and rdd2. I have a common custom Partitioner. I am trying to find a way to take union or join by avoiding or minimizing data movement.

val kafkaRdd1 = /* from kafka sources */
val kafkaRdd2 = /* from kafka sources */

val rdd1 = kafkaRdd1.partitionBy(new MyCustomPartitioner(24))
val rdd2 = kafkaRdd2.partitionBy(new MyCustomPartitioner(24))

val rdd3 = rdd1.union(rdd2) // Without shuffle
val rdd3 = rdd1.leftOuterjoin(rdd2) // Without shuffle

Is it safe to assume (or a way to enforce) the nth-Partition of both rdd1 and rdd2 on same slave node?

zero323 · Accepted Answer

It is not possible to enforce* colocation in Spark but the method you use will minimize data movement. When PartitionerAwareUnionRDD is created input RDDs are analyzed to choose optimal output locations based on the number of records per location. See getPreferredLocations method for details.

* According to High Performance Spark

Two RDDs will be colocated if they have the same partitioner and were shuffled as part of the same action.

Spark: Force two RDD[Key, Value] with co-located partitions using custom partitioner

Answers (1)

Related Questions