Reputation: 480
I have looked at the API and found the following documentation for both-
def reduceByKey(partitioner: Partitioner, func: (V, V) ⇒ V): RDD[(K, V)]
which merges the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
def reduceByKeyLocally(func: (V, V) ⇒ V): Map[K, V]
which merges the values for each key using an associative reduce function, but return the results immediately to the master as a Map. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
I don't see much difference between the two except that reduceByKeyLocally returns the results back to the master as a map.
Upvotes: 3
Views: 1350
Reputation: 30300
The difference is profound.
With reduceByKey
, the pairs are represented as an RDD
, which means the data remain distributed among the cluster. This is necessary when you are operating at scale.
With reduceByKeyLocally
, all the partitions come back to the master to be merged into a single Map
on that single machine. Similar to the collect
action, which brings everything back to the master as an Array
, if you are operating at scale, all those data will overwhelm a single machine completely and defeat the purpose of using a distributed data abstraction.
Upvotes: 7