Ed Ding
Ed Ding

Reputation: 41

pyspark map type contains duplicate keys

Could someone help me understand why the map type in pyspark could contain duplicate keys?

An example would be:

# generate a sample dataframe
# the `field` column is an array of struct with value a and value b
# the goal is to create a map from a -> b 

df = spark.createDataFrame([{
    'field': [Row(a=1, b=2), Row(a=1, b=3)],
}])


# above code would generate a dataframe like this
+----------------+
|           field|
+----------------+
|[[1, 2], [1, 3]]|
+----------------+

# with schema
root
 |-- field: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: long (nullable = true)

Then I applied map_from_entries on this dataframe, trying to collect unique a->b pairs. I was expecting the map to contain unique keys, that is {1 -> 3} in this case. However, I'm getting {1 -> 2, 1 -> 3} before collecting. This contradict the common idea of a map type.

import pyspark.sql.functions as F
df.select(F.map_from_entries("field"))

# the result is
+-----------------------+
|map_from_entries(field)|
+-----------------------+
|       [1 -> 2, 1 -> 3]|
+-----------------------+

I also tried to apply F.map_keys() on this field and got [1, 1] as the result. However, when I collect this dataframe, I was able to get the result without duplicate keys:

df.select(F.map_from_entries("field")).collect()

# result
[Row(map_from_entries(field)={1: 3})]

This is causing some unexpected behavior in my spark job, and I would really appreciate if someone could help me understand why pyspark is behaving in this way. Is this a bug or by design?

Upvotes: 4

Views: 7659

Answers (1)

mck
mck

Reputation: 42352

It goes back to the implementation of maps in Scala: https://www.scala-lang.org/api/2.12.2/scala/collection/immutable/List.html#toMap[T,U]:scala.collection.Map[T,U]

Duplicate keys will be overwritten by later keys: if this is an unordered collection, which key is in the resulting map is undefined

Therefore the map 1->3 overwrites 1->2. This is the designed behaviour and not a bug.

Upvotes: 3

Related Questions