Reputation: 21
I've been working on some data, trying to clean some data and I've found some difficulties that I hope this community can help me with.
I have this json structure
{ "date": "04/23/2023"
,"eventName": "click"
,"UserProperties": [{ "key": "Name"
,"string_value": "Lynn Smith"
,"int_value": null
},
{ "key": "Age"
,"string_value": null
,"int_value": 13
},
{ "key": "Gender"
,"string_value": "Female"
,"int_value": null
}
]
,"EventParams": [{ "key": "Website"
,"string_value": "xxx.com"
,"int_value": null
},
{ "key": "pageTitle"
,"string_value": "zzzz"
,"int_value": null
}
]
,"Geography": {"country": "xxx", "city": "yyy"}
}
This is how I want my table to look
date | eventName | country | city | Name | Age | Gender | Website | pageTitle |
---|---|---|---|---|---|---|---|---|
04/23/2023 | click | xxx | yyy | Lynn Smith | 13 | Female | xxx.com | zzzz |
I manage to get this data into a dataframe with this structure
date | eventName | country | city | UserProperties | EventParams |
---|---|---|---|---|---|
04/23/2023 | click | xxx | yyy | [{ "key": "Name" | [{ "key": "Website" |
,"string_value": "Lynn Smith" | ,"string_value": "xxx.com" | ||||
,"int_value": null | ,"int_value": null | ||||
}, | }, | ||||
{ "key": "Age" | { "key": "pageTitle" | ||||
,"string_value": null | ,"string_value": "zzzz" | ||||
,"int_value": 13 | ,"int_value": null | ||||
}, | } | ||||
{ "key": "Gender" | ] | ||||
,"string_value": "Female" | |||||
,"int_value": null | |||||
} | |||||
] |
I also was able to use the explode function, but I had to use it twice (one for the UserProperties and other for the EventParams). This was the result, but dont know how to turn on the format that I want.
date | eventName | country | city | upKey | upValue | epKey | epValue |
---|---|---|---|---|---|---|---|
04/23/2023 | click | xxx | yyy | Name | Lynn Smith | Website | xxx.com |
04/23/2023 | click | xxx | yyy | Name | Lynn Smith | pageTitle | zzzz |
04/23/2023 | click | xxx | yyy | Age | 13 | Website | xxx.com |
04/23/2023 | click | xxx | yyy | Age | 13 | pageTitle | zzzz |
04/23/2023 | click | xxx | yyy | Gender | Female | Website | xxx.com |
04/23/2023 | click | xxx | yyy | Gender | Female | pageTitle | zzzz |
The main reason I want this kind of result is because I want to count the number of events by age for example. For this specific example the count is 1.
Also, I can have multiple row per date, eventName, country, city the difference beween events can be in any of the keys within the UserProperties and EventParams.
Thank you very much in advance.
Upvotes: 1
Views: 328
Reputation: 46
You can create a separate column which contains each key value pair in a single json and then flatten it.
Upvotes: 0
Reputation: 1
If the json has always the same structure...
I have an answer in scala, I think you can easily translate it to python.
val df = first_df.select("*", "Geography.*")
.drop("Geography")
val df1 = df.withColumn("exploded", explode(col("EventParams")))
.drop("EventParams")
.select("*", "exploded.*")
.withColumn("value", coalesce(col("string_value"), col("int_value")))
.drop("exploded")
val df2 = df1.groupBy("UserProperties", "date","eventName","city","country")
.pivot("key")
.agg(first(col("string_value")))
val df3 = df2.withColumn("exploded", explode(col("UserProperties")))
.drop("UserProperties")
.select("*", "exploded.*")
.drop("exploded")
.withColumn("value", coalesce(col("string_value"),col("int_value")))
.groupBy("date", "eventName", "city", "country", "Website", "pageTitle")
.pivot("key")
.agg(first(col("value")))
df3.show(false)
+----------+---------+----+-------+-------+---------+---+------+----------+
|date |eventName|city|country|Website|pageTitle|Age|Gender|Name |
+----------+---------+----+-------+-------+---------+---+------+----------+
|04/23/2023|click |yyy |xxx |xxx.com|zzzz |13 |Female|Lynn Smith|
+----------+---------+----+-------+-------+---------+---+------+----------+
Upvotes: 0
Reputation: 21
This is how I solved this issue:
plain_df = df.select( F.col('date') \
,F.col('eventName') \
,F.col('Geography.country') \
,F.col('Geography.city') \
,transform(col('UserProperties'), lambda x: x['key'] ).alias('UserProperties_keys') \
,transform(col('UserProperties'), lambda x: coalesce( x['value']['string_value'] \
, x['value']['int_value']
).cast('string')).alias('UserProperties_values')
,transform(col('EventParams'), lambda x: x['key'] ).alias('EventParams_keys') \
,transform(col('EventParams'), lambda x: coalesce( x['value']['string_value'] \
, x['value']['int_value'] \
).cast('string')).alias('EventParams_values')
)
#convert to a map
plain_df = df.select( F.col('date') \
,F.col('eventName') \
,F.col('country') \
,F.col('city') \
,map_from_arrays(col('UserProperties_keys'), col('UserProperties_values')).alias('map_UserProperties') \
,map_from_arrays(col('EventParams_keys'), col('EventParams_values')).alias('map_EventParams')
)
#getItems from map
plain_df = plain_df.select( F.col('date') \
,F.col('eventName') \
,F.col('country') \
,F.col('city') \
,col('map_UserProperties').getItem('Name').alias('Name') \
,col('map_UserProperties').getItem('Age').alias('Age') \
,col('map_UserProperties').getItem('Gender').alias('Gender') \
,col('map_EventParams').getItem('Website').alias('Website') \
,col('map_EventParams').getItem('pageTitle').alias('pageTitle')
)
I would like to know if there is a dynamical way to obtain the items, because it is not going to appear in a specific order and I can have more or less items on the json array UserProperties or EventParams.
Any ideas are welcome. Thanks!
Upvotes: 1