Reputation: 5457
I have two RDDs with different keys:
RDD1: (K1, V1)
RDD2: (K2, V2)
And I have a function that operates on data from V2
and that subsequently maps K2
and K1
. The result is a new RDD, RDD3: (K1, V2_transformed)
. My end results are based on some operations on RDD1
's V1
and RDD3
's V2_transformed
by key.
It seems to me that it would be beneficial to have RDD3
be distributed the same way as RDD1
, to avoid a costly join afterwards. Is there a way to a priori specify that I want RDD3
distributed the same as RDD1
?
I work with PySpark.
Upvotes: 1
Views: 257
Reputation: 18750
You can use rdd.partitionBy(new HashPartitioner(numpartitions))
, if you use the same partitioner for both RDD's you should be fine.
Upvotes: 3