Yuchen
Yuchen

Reputation: 33116

Flatten Spark Dataframe column of map/dictionary into multiple columns

We have a DataFrame that looks like this:

DataFrame[event: string, properties: map<string,string>]

Notice that there are two columns: event and properties. How do we split or flatten the properties column into multiple columns based on the key values in the map?


I notice I can do something like this:

newDf = df.withColumn("foo", col("properties")["foo"])

which produce a Dataframe of

DataFrame[event: string, properties: map<string,string>, foo: String]

But then I would have to do these for all the keys one by one. Is there a way to do them all automatically? For example, if there are foo, bar, baz as the keys in the properties, can we flatten the map:

DataFrame[event: string, foo: String, bar: String, baz: String]

Upvotes: 3

Views: 4544

Answers (1)

Mariusz
Mariusz

Reputation: 13946

You can use explode() function - it flattens the map by creating two additional columns - key and value for each entry:

>>> df.printSchema()
root
 |-- event: string (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

>>> df.select('event', explode('properties')).printSchema()
root
 |-- event: string (nullable = true)
 |-- key: string (nullable = false)
 |-- value: string (nullable = true)

You can use pivot if you have a column with unique value you can group by. For example:

df.withColumn('id', monotonically_increasing_id()) \
    .select('id', 'event', explode('properties')) \
    .groupBy('id', 'event').pivot('key').agg(first('value'))

Upvotes: 4

Related Questions