Reputation: 41
Could someone help me understand why the map type in pyspark could contain duplicate keys?
An example would be:
# generate a sample dataframe
# the `field` column is an array of struct with value a and value b
# the goal is to create a map from a -> b
df = spark.createDataFrame([{
'field': [Row(a=1, b=2), Row(a=1, b=3)],
}])
# above code would generate a dataframe like this
+----------------+
| field|
+----------------+
|[[1, 2], [1, 3]]|
+----------------+
# with schema
root
|-- field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- b: long (nullable = true)
Then I applied map_from_entries
on this dataframe, trying to collect unique a->b
pairs. I was expecting the map to contain unique keys, that is {1 -> 3}
in this case. However, I'm getting {1 -> 2, 1 -> 3} before collecting. This contradict the common idea of a map
type.
import pyspark.sql.functions as F
df.select(F.map_from_entries("field"))
# the result is
+-----------------------+
|map_from_entries(field)|
+-----------------------+
| [1 -> 2, 1 -> 3]|
+-----------------------+
I also tried to apply F.map_keys()
on this field and got [1, 1]
as the result. However, when I collect this dataframe, I was able to get the result without duplicate keys:
df.select(F.map_from_entries("field")).collect()
# result
[Row(map_from_entries(field)={1: 3})]
This is causing some unexpected behavior in my spark job, and I would really appreciate if someone could help me understand why pyspark is behaving in this way. Is this a bug or by design?
Upvotes: 4
Views: 7659
Reputation: 42352
It goes back to the implementation of maps in Scala: https://www.scala-lang.org/api/2.12.2/scala/collection/immutable/List.html#toMap[T,U]:scala.collection.Map[T,U]
Duplicate keys will be overwritten by later keys: if this is an unordered collection, which key is in the resulting map is undefined
Therefore the map 1->3 overwrites 1->2. This is the designed behaviour and not a bug.
Upvotes: 3