Reputation: 76
I am trying to understand as to what happens when we run the collectAsMap
() function in spark. As per the Pyspark docs,it says,
collectAsMap(self) Return the key-value pairs in this RDD to the master as a dictionary.
and for core spark it says,
def collectAsMap(): Map[K, V] Return the key-value pairs in this RDD to the master as a Map.
When I try to run a sample code in pyspark for a List, I get this result:
and for scala I get this result:
I am a little confused as to why it is not returning all the elements in the List. Can somebody help me understand what is happening in this scenario as to why I am getting selective results.
Thanks.
Upvotes: 1
Views: 13140
Reputation: 186
collectAsMap
will return the results for paired RDD
as Map collection. And since it is returning Map collection you will only get pairs with unique keys and pairs with duplicate keys will be removed.
Upvotes: 1
Reputation: 29646
The semantics of collectAsMap
are identical between the Scala and Python APIs so I'll look at the first WLOG. The documentation for PairRDDFunctions.collectAsMap
explicitly states:
Warning: this doesn't return a multimap (so if you have multiple values to the same key, only one value per key is preserved in the map returned)
In particular, the current implementation inserts the key-value pairs into the resultant map in order and thus only the last two pairs survive in each of your two examples.
If you use collect
instead, it will return Array[(Int,Int)]
without losing any of your pairs.
Upvotes: 4