Reputation: 76

How does the collectAsMap() function work for Spark API

I am trying to understand as to what happens when we run the collectAsMap() function in spark. As per the Pyspark docs,it says,

collectAsMap(self) Return the key-value pairs in this RDD to the master as a dictionary.

and for core spark it says,

def collectAsMap(): Map[K, V] Return the key-value pairs in this RDD to the master as a Map.

When I try to run a sample code in pyspark for a List, I get this result:

and for scala I get this result:

I am a little confused as to why it is not returning all the elements in the List. Can somebody help me understand what is happening in this scenario as to why I am getting selective results.

Thanks.

Upvotes: 1

Answers (2)

Manish

Reputation: 186

collectAsMap will return the results for paired RDD as Map collection. And since it is returning Map collection you will only get pairs with unique keys and pairs with duplicate keys will be removed.

Upvotes: 1

obataku

Reputation: 29646

The semantics of collectAsMap are identical between the Scala and Python APIs so I'll look at the first WLOG. The documentation for PairRDDFunctions.collectAsMap explicitly states:

Warning: this doesn't return a multimap (so if you have multiple values to the same key, only one value per key is preserved in the map returned)

In particular, the current implementation inserts the key-value pairs into the resultant map in order and thus only the last two pairs survive in each of your two examples.

If you use collect instead, it will return Array[(Int,Int)] without losing any of your pairs.

Upvotes: 4

How does the collectAsMap() function work for Spark API

Answers (2)

Related Questions