Uday Sagar
Uday Sagar

Reputation: 480

apache spark - which one encounters less memory bottlenecks - reduceByKey or reduceByKeyLocally?

I have looked at the API and found the following documentation for both-

def reduceByKey(partitioner: Partitioner, func: (V, V) ⇒ V): RDD[(K, V)]

which merges the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.

def reduceByKeyLocally(func: (V, V) ⇒ V): Map[K, V]

which merges the values for each key using an associative reduce function, but return the results immediately to the master as a Map. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.

I don't see much difference between the two except that reduceByKeyLocally returns the results back to the master as a map.

Upvotes: 3

Views: 1350

Answers (1)

Vidya
Vidya

Reputation: 30300

The difference is profound.

With reduceByKey, the pairs are represented as an RDD, which means the data remain distributed among the cluster. This is necessary when you are operating at scale.

With reduceByKeyLocally, all the partitions come back to the master to be merged into a single Map on that single machine. Similar to the collect action, which brings everything back to the master as an Array, if you are operating at scale, all those data will overwhelm a single machine completely and defeat the purpose of using a distributed data abstraction.

Upvotes: 7

Related Questions