Extract key value from dataframe in PySpark

Question

I have the below dataframe which I have read from a JSON file.

1	2	3	4
{"todo":["wakeup", "shower"]}	{"todo":["brush", "eat"]}	{"todo":["read", "write"]}	{"todo":["sleep", "snooze"]}

I need my output to be as below Key and Value. How do I do this? Do I need to create a schema?

ID	todo
1	wakeup, shower
2	brush, eat
3	read, write
4	sleep, snooze

ZygD · Accepted Answer

The key-value which you refer to is a struct. "keys" are struct field names, while "values" are field values.

What you want to do is called unpivoting. One of the ways to do it in PySpark is using stack. The following is a dynamic approach, where you don't need to provide existent column names.

Input dataframe:

df = spark.createDataFrame(
    [((['wakeup', 'shower'],),(['brush', 'eat'],),(['read', 'write'],),(['sleep', 'snooze'],))],
    '`1` struct>, `2` struct>, `3` struct>, `4` struct>')

Script:

to_melt = [f"\'{c}\', `{c}`.todo" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) (ID, todo)")

df.show()
# +---+----------------+
# | ID|            todo|
# +---+----------------+
# |  1|[wakeup, shower]|
# |  2|    [brush, eat]|
# |  3|   [read, write]|
# |  4| [sleep, snooze]|
# +---+----------------+

Extract key value from dataframe in PySpark

Answers (2)

Related Questions