Reputation: 7840
I have data frame as below
Json_column
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
{"coordinates":[null,null,null,null,null],"datetime":[1642602463000,1642600679000,1642598301000,1642598232000,1642596529000],"followers_count":[568,5037,76,4325,107]}
{"coordinates":[null,null,null,null,null],"datetime":[1641919643000,1641918112000,1641918082000,1641917719000,1641916830000],"followers_count":[233,63,99,750,186]}
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
need to flatten this data frame as below
+---------------------------------------------------------------------------+-------------------------------------------------------------------------------+--------------------------+
|datetime |coordinates |followers_count |
+---------------------------------------------------------------------------+-------------------------------------------------------------------------------+--------------------------+
|1642602463000 | null | 568 |
|1642600679000 | null | 5037 |
|1642598301000 | null | 76 |
|1642598232000 | null | 4325 |
|1642596529000 | null | 107 |
|1642602463000 | null | 233 |
|1641918112000 | null | 63 |
|1641918082000 | null | 99 |
|1641917719000 | null | 750 |
|1641916830000 | null | 186 |
+---------------------------------------------------------------------------+-------------------------------------------------------------------------------+--------------------------+
I tried this code
df.withColumn("datetime",F.get_json_object(col("Json_column"),"$.datetime")).
withColumn("coordinates",F.get_json_object(col("Json_column"),"$.coordinates")).
withColumn("followers_count",F.get_json_object(col("Json_column"),"$.followers_count"))
.select('datetime','followers_count','coordinates')
but it return array list not flatten data
Upvotes: 0
Views: 73
Reputation: 32640
Use from_json
to parse the json strings into struct type, then arrays_zip
the arrays fields inside the struct and explode the result:
from pyspark.sql import functions as F
result = df.withColumn(
"Json_column",
F.from_json(
"Json_column",
"struct<coordinates:array<string>,datetime:array<long>,followers_count:array<int>>"
)
).withColumn(
"Json_column",
F.arrays_zip("Json_column.datetime", "Json_column.coordinates", "Json_column.followers_count")
).selectExpr(
"inline(Json_column)"
)
result.show()
#+-------------+-----------+---------------+
#|datetime |coordinates|followers_count|
#+-------------+-----------+---------------+
#|1642602463000|null |568 |
#|1642600679000|null |5037 |
#|1642598301000|null |76 |
#|1642598232000|null |4325 |
#|1642596529000|null |107 |
#|1641919643000|null |233 |
#|1641918112000|null |63 |
#|1641918082000|null |99 |
#|1641917719000|null |750 |
#|1641916830000|null |186 |
#+-------------+-----------+---------------+
Upvotes: 1