Reputation: 1782
When I join
two RDD
s where is the data actually joined, i.e. is the data aggregated on the driver and then shipped back out to the worker nodes, or is one of the nodes randomly selected to "receive" the data? Furthermore, if I call partition
on a pairRDD
then is the partitioning done by key automatically?
Upvotes: 4
Views: 1610
Reputation: 66866
No, it does not proceed via the driver or any single node. A shuffle happens wherein each of many tasks across executors collects all values (from both parents) for a subset of keys. The tasks form the join product for each key as it is iterated through. Partitioning is by key. Joining two identically-partitioned RDDs is advantageous as you avoid the shuffle.
Upvotes: 4