I've been working on some data, trying to clean some data and I've found some difficulties that I hope this community can help me with. I have this json structure { "date": "04/23/2023" ,"eventName": "click" ,"UserProperties": [{ "key": "Name" ,"string_value": "Lynn Smith" ,"int_value": null }, { "key": "Age" ,"string_value": null ,"int_value": 13 }, { "key": "Gender" ,"string_value": "Female" ,"int_value": null } ] ,"EventParams": [{ "key": "Website" ,"string_value": "xxx.com" ,"int_value": null }, { "key": "pageTitle" ,"string_value": "zzzz" ,"int_value": null } ] ,"Geography": {"country": "xxx", "city": "yyy"} } This is how I want my table to look date eventName country city Name Age Gender Website pageTitle 04/23/2023 click xxx yyy Lynn Smith 13 Female xxx.com zzzz I manage to get this data into a dataframe with this structure date eventName country city UserProperties EventParams 04/23/2023 click xxx yyy [{ "key": "Name" [{ "key": "Website" ,"string_value": "Lynn Smith" ,"string_value": "xxx.com" ,"int_value": null ,"int_value": null }, }, { "key": "Age" { "key": "pageTitle" ,"string_value": null ,"string_value": "zzzz" ,"int_value": 13 ,"int_value": null }, } { "key": "Gender" ] ,"string_value": "Female" ,"int_value": null } ] I also was able to use the explode function, but I had to use it twice (one for the UserProperties and other for the EventParams). This was the result, but dont know how to turn on the format that I want. date eventName country city upKey upValue epKey epValue 04/23/2023 click xxx yyy Name Lynn Smith Website xxx.com 04/23/2023 click xxx yyy Name Lynn Smith pageTitle zzzz 04/23/2023 click xxx yyy Age 13 Website xxx.com 04/23/2023 click xxx yyy Age 13 pageTitle zzzz 04/23/2023 click xxx yyy Gender Female Website xxx.com 04/23/2023 click xxx yyy Gender Female pageTitle zzzz The main reason I want this kind of result is because I want to count the number of events by age for example. For this specific example the count is 1. Also, I can have multiple row per date, eventName, country, city the difference beween events can be in any of the keys within the UserProperties and EventParams. Thank you very much in advance.

Reputation: 21

Databricks PySpark Explode and Pivot Columns

I've been working on some data, trying to clean some data and I've found some difficulties that I hope this community can help me with.

I have this json structure

{ "date": "04/23/2023"
 ,"eventName": "click"
 ,"UserProperties": [{ "key": "Name"
                      ,"string_value": "Lynn Smith"
                      ,"int_value": null
                     },
                     { "key": "Age"
                      ,"string_value": null
                      ,"int_value": 13
                     },
                     { "key": "Gender"
                      ,"string_value": "Female"
                      ,"int_value": null
                     }
                    ]

 ,"EventParams": [{ "key": "Website"
                   ,"string_value": "xxx.com"
                   ,"int_value": null
                  },
                  { "key": "pageTitle"
                   ,"string_value": "zzzz"
                   ,"int_value": null
                  }
                 ]
,"Geography": {"country": "xxx", "city": "yyy"}
}

This is how I want my table to look

date	eventName	country	city	Name	Age	Gender	Website	pageTitle
04/23/2023	click	xxx	yyy	Lynn Smith	13	Female	xxx.com	zzzz

I manage to get this data into a dataframe with this structure

date	eventName	country	city	UserProperties	EventParams
04/23/2023	click	xxx	yyy	[{ "key": "Name"	[{ "key": "Website"
				,"string_value": "Lynn Smith"	,"string_value": "xxx.com"
				,"int_value": null	,"int_value": null
				},	},
				{ "key": "Age"	{ "key": "pageTitle"
				,"string_value": null	,"string_value": "zzzz"
				,"int_value": 13	,"int_value": null
				},	}
				{ "key": "Gender"	]
				,"string_value": "Female"
				,"int_value": null
				}
				]

I also was able to use the explode function, but I had to use it twice (one for the UserProperties and other for the EventParams). This was the result, but dont know how to turn on the format that I want.

date	eventName	country	city	upKey	upValue	epKey	epValue
04/23/2023	click	xxx	yyy	Name	Lynn Smith	Website	xxx.com
04/23/2023	click	xxx	yyy	Name	Lynn Smith	pageTitle	zzzz
04/23/2023	click	xxx	yyy	Age	13	Website	xxx.com
04/23/2023	click	xxx	yyy	Age	13	pageTitle	zzzz
04/23/2023	click	xxx	yyy	Gender	Female	Website	xxx.com
04/23/2023	click	xxx	yyy	Gender	Female	pageTitle	zzzz

The main reason I want this kind of result is because I want to count the number of events by age for example. For this specific example the count is 1.

Also, I can have multiple row per date, eventName, country, city the difference beween events can be in any of the keys within the UserProperties and EventParams.

Thank you very much in advance.

Upvotes: 1

Answers (3)

Pallav Garg

Reputation: 46

You can create a separate column which contains each key value pair in a single json and then flatten it.

Upvotes: 0

Mauri

Reputation: 1

If the json has always the same structure...

I have an answer in scala, I think you can easily translate it to python.

val df = first_df.select("*", "Geography.*")
  .drop("Geography")

val df1 = df.withColumn("exploded", explode(col("EventParams")))
  .drop("EventParams")
  .select("*", "exploded.*")
  .withColumn("value", coalesce(col("string_value"), col("int_value")))
  .drop("exploded")

val df2 = df1.groupBy("UserProperties", "date","eventName","city","country")
  .pivot("key")
  .agg(first(col("string_value")))

val df3 = df2.withColumn("exploded", explode(col("UserProperties")))
  .drop("UserProperties")
  .select("*", "exploded.*")
  .drop("exploded")
  .withColumn("value", coalesce(col("string_value"),col("int_value")))
  .groupBy("date", "eventName", "city", "country", "Website", "pageTitle")
  .pivot("key")
  .agg(first(col("value")))

df3.show(false)

+----------+---------+----+-------+-------+---------+---+------+----------+
|date      |eventName|city|country|Website|pageTitle|Age|Gender|Name      |
+----------+---------+----+-------+-------+---------+---+------+----------+
|04/23/2023|click    |yyy |xxx    |xxx.com|zzzz     |13 |Female|Lynn Smith|
+----------+---------+----+-------+-------+---------+---+------+----------+

Upvotes: 0

KarenFri

Reputation: 21

This is how I solved this issue:

plain_df = df.select( F.col('date') \
                     ,F.col('eventName') \
                     ,F.col('Geography.country') \
                     ,F.col('Geography.city') \
                     ,transform(col('UserProperties'), lambda x: x['key'] ).alias('UserProperties_keys') \
                     ,transform(col('UserProperties'), lambda x: coalesce(  x['value']['string_value'] \
                                                                        , x['value']['int_value'] 
                                                                       ).cast('string')).alias('UserProperties_values')
                     ,transform(col('EventParams'), lambda x: x['key'] ).alias('EventParams_keys') \
                     ,transform(col('EventParams'), lambda x: coalesce(  x['value']['string_value'] \
                                                                        , x['value']['int_value'] \
                                                                       ).cast('string')).alias('EventParams_values')
                    )
        
        
        
#convert to a map
plain_df = df.select(  F.col('date') \
                      ,F.col('eventName') \
                      ,F.col('country') \
                      ,F.col('city') \
                      ,map_from_arrays(col('UserProperties_keys'), col('UserProperties_values')).alias('map_UserProperties') \
                      ,map_from_arrays(col('EventParams_keys'), col('EventParams_values')).alias('map_EventParams') 
                    )
        
#getItems from map
plain_df = plain_df.select( F.col('date') \
                           ,F.col('eventName') \
                           ,F.col('country') \
                           ,F.col('city') \
                           ,col('map_UserProperties').getItem('Name').alias('Name') \
                           ,col('map_UserProperties').getItem('Age').alias('Age') \
                           ,col('map_UserProperties').getItem('Gender').alias('Gender') \
                           ,col('map_EventParams').getItem('Website').alias('Website') \
                           ,col('map_EventParams').getItem('pageTitle').alias('pageTitle') 
                          )

I would like to know if there is a dynamical way to obtain the items, because it is not going to appear in a specific order and I can have more or less items on the json array UserProperties or EventParams.

Any ideas are welcome. Thanks!

Upvotes: 1

Databricks PySpark Explode and Pivot Columns

Answers (3)

Related Questions