Distribute a new RDD like an existing RDD in Spark?

Question

I have two RDDs with different keys:

RDD1: (K1, V1)
RDD2: (K2, V2)

And I have a function that operates on data from V2 and that subsequently maps K2 and K1. The result is a new RDD, RDD3: (K1, V2_transformed). My end results are based on some operations on RDD1's V1 and RDD3's V2_transformed by key.

It seems to me that it would be beneficial to have RDD3 be distributed the same way as RDD1, to avoid a costly join afterwards. Is there a way to a priori specify that I want RDD3 distributed the same as RDD1?

I work with PySpark.

aaronman · Accepted Answer

You can use rdd.partitionBy(new HashPartitioner(numpartitions)), if you use the same partitioner for both RDD's you should be fine.

Distribute a new RDD like an existing RDD in Spark?

Answers (1)

Related Questions