Reputation: 102
I have a pyspark dataframe in which one of the column is in below format:
[{key1: value1},{key2:value2}, {key3:value3}, {key4:value4}]
Lets say it as ColumnY as below:
ColumnY |
---|
[{key1: value1},{key2:value2}, {key3:value3}, {key4:value4}] |
I would like to convert it into columns of the dataframe where column name is keyX and its contents are valueX where X=[1,4] as below:
key 1 | key 2 | key 3 | key 4 |
---|---|---|---|
value1 | value2 | value3 | value4 |
I have tried some solutions but they didn't work. Request you to share any ideas or solutions if you have. Thank you in advance.
Upvotes: 0
Views: 944
Reputation: 42352
That is a very badly formatted JSON without any quotes, but you can still parse it by brute force:
import pyspark.sql.functions as F
df2 = df.selectExpr("""
explode(
transform(
split(ColumnY, ','),
x -> str_to_map(regexp_replace(x, '[\\\\[\\\\{ \\\\]\\\\}]', ''), ' ', ':')
)
) as col
""").select(F.explode('col')).groupBy().pivot('key').agg(F.first('value'))
df2.show()
+------+------+------+------+
| key1| key2| key3| key4|
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
Upvotes: 1