RData
RData

Reputation: 969

Extract key value from dataframe in PySpark

I have the below dataframe which I have read from a JSON file.

1 2 3 4
{"todo":["wakeup", "shower"]} {"todo":["brush", "eat"]} {"todo":["read", "write"]} {"todo":["sleep", "snooze"]}

I need my output to be as below Key and Value. How do I do this? Do I need to create a schema?

ID todo
1 wakeup, shower
2 brush, eat
3 read, write
4 sleep, snooze

Upvotes: 1

Views: 2820

Answers (2)

wwnde
wwnde

Reputation: 26676

Use from_json to convert string to array. Explode to cascade each unique element to row.

data

df = spark.createDataFrame(
    [(('{"todo":"[wakeup, shower]"}'),('{"todo":"[brush, eat]"}'),('{"todo":"[read, write]"}'),('{"todo":"[sleep, snooze]"}'))],
    ('value1','values2','value3','value4'))

code

new = (df.withColumn('todo', explode(flatten(array(*[map_values(from_json(x, "MAP<STRING,STRING>")) for x in df.columns])))) #From string to array to indivicual row
   .withColumn('todo', translate('todo',"[]",'')#Remove corner brackets
              ) ).show(truncate=False)

outcome

+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|value1                     |values2                |value3                  |value4                    |todo          |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|wakeup, shower|
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|brush, eat    |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|read, write   |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|sleep, snooze |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+

Upvotes: 1

ZygD
ZygD

Reputation: 24488

The key-value which you refer to is a struct. "keys" are struct field names, while "values" are field values.

What you want to do is called unpivoting. One of the ways to do it in PySpark is using stack. The following is a dynamic approach, where you don't need to provide existent column names.

Input dataframe:

df = spark.createDataFrame(
    [((['wakeup', 'shower'],),(['brush', 'eat'],),(['read', 'write'],),(['sleep', 'snooze'],))],
    '`1` struct<todo:array<string>>, `2` struct<todo:array<string>>, `3` struct<todo:array<string>>, `4` struct<todo:array<string>>')

Script:

to_melt = [f"\'{c}\', `{c}`.todo" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) (ID, todo)")

df.show()
# +---+----------------+
# | ID|            todo|
# +---+----------------+
# |  1|[wakeup, shower]|
# |  2|    [brush, eat]|
# |  3|   [read, write]|
# |  4| [sleep, snooze]|
# +---+----------------+

Upvotes: 1

Related Questions