monster
monster

Reputation: 1782

Apache Spark RDD partitioning and join

When I join two RDDs where is the data actually joined, i.e. is the data aggregated on the driver and then shipped back out to the worker nodes, or is one of the nodes randomly selected to "receive" the data? Furthermore, if I call partition on a pairRDD then is the partitioning done by key automatically?

Upvotes: 4

Views: 1610

Answers (1)

Sean Owen
Sean Owen

Reputation: 66866

No, it does not proceed via the driver or any single node. A shuffle happens wherein each of many tasks across executors collects all values (from both parents) for a subset of keys. The tasks form the join product for each key as it is iterated through. Partitioning is by key. Joining two identically-partitioned RDDs is advantageous as you avoid the shuffle.

Upvotes: 4

Related Questions