KarenFri
KarenFri

Reputation: 21

Databricks PySpark Explode and Pivot Columns

I've been working on some data, trying to clean some data and I've found some difficulties that I hope this community can help me with.

I have this json structure

{ "date": "04/23/2023"
 ,"eventName": "click"
 ,"UserProperties": [{ "key": "Name"
                      ,"string_value": "Lynn Smith"
                      ,"int_value": null
                     },
                     { "key": "Age"
                      ,"string_value": null
                      ,"int_value": 13
                     },
                     { "key": "Gender"
                      ,"string_value": "Female"
                      ,"int_value": null
                     }
                    ]

 ,"EventParams": [{ "key": "Website"
                   ,"string_value": "xxx.com"
                   ,"int_value": null
                  },
                  { "key": "pageTitle"
                   ,"string_value": "zzzz"
                   ,"int_value": null
                  }
                 ]
,"Geography": {"country": "xxx", "city": "yyy"}
}

This is how I want my table to look

date eventName country city Name Age Gender Website pageTitle
04/23/2023 click xxx yyy Lynn Smith 13 Female xxx.com zzzz

I manage to get this data into a dataframe with this structure

date eventName country city UserProperties EventParams
04/23/2023 click xxx yyy [{ "key": "Name" [{ "key": "Website"
,"string_value": "Lynn Smith" ,"string_value": "xxx.com"
,"int_value": null ,"int_value": null
}, },
{ "key": "Age" { "key": "pageTitle"
,"string_value": null ,"string_value": "zzzz"
,"int_value": 13 ,"int_value": null
}, }
{ "key": "Gender" ]
,"string_value": "Female"
,"int_value": null
}
]

I also was able to use the explode function, but I had to use it twice (one for the UserProperties and other for the EventParams). This was the result, but dont know how to turn on the format that I want.

date eventName country city upKey upValue epKey epValue
04/23/2023 click xxx yyy Name Lynn Smith Website xxx.com
04/23/2023 click xxx yyy Name Lynn Smith pageTitle zzzz
04/23/2023 click xxx yyy Age 13 Website xxx.com
04/23/2023 click xxx yyy Age 13 pageTitle zzzz
04/23/2023 click xxx yyy Gender Female Website xxx.com
04/23/2023 click xxx yyy Gender Female pageTitle zzzz

The main reason I want this kind of result is because I want to count the number of events by age for example. For this specific example the count is 1.

Also, I can have multiple row per date, eventName, country, city the difference beween events can be in any of the keys within the UserProperties and EventParams.

Thank you very much in advance.

Upvotes: 1

Views: 328

Answers (3)

Pallav Garg
Pallav Garg

Reputation: 46

You can create a separate column which contains each key value pair in a single json and then flatten it.

Upvotes: 0

Mauri
Mauri

Reputation: 1

If the json has always the same structure...

I have an answer in scala, I think you can easily translate it to python.

val df = first_df.select("*", "Geography.*")
  .drop("Geography")

val df1 = df.withColumn("exploded", explode(col("EventParams")))
  .drop("EventParams")
  .select("*", "exploded.*")
  .withColumn("value", coalesce(col("string_value"), col("int_value")))
  .drop("exploded")

val df2 = df1.groupBy("UserProperties", "date","eventName","city","country")
  .pivot("key")
  .agg(first(col("string_value")))

val df3 = df2.withColumn("exploded", explode(col("UserProperties")))
  .drop("UserProperties")
  .select("*", "exploded.*")
  .drop("exploded")
  .withColumn("value", coalesce(col("string_value"),col("int_value")))
  .groupBy("date", "eventName", "city", "country", "Website", "pageTitle")
  .pivot("key")
  .agg(first(col("value")))

df3.show(false)
+----------+---------+----+-------+-------+---------+---+------+----------+
|date      |eventName|city|country|Website|pageTitle|Age|Gender|Name      |
+----------+---------+----+-------+-------+---------+---+------+----------+
|04/23/2023|click    |yyy |xxx    |xxx.com|zzzz     |13 |Female|Lynn Smith|
+----------+---------+----+-------+-------+---------+---+------+----------+

Upvotes: 0

KarenFri
KarenFri

Reputation: 21

This is how I solved this issue:

plain_df = df.select( F.col('date') \
                     ,F.col('eventName') \
                     ,F.col('Geography.country') \
                     ,F.col('Geography.city') \
                     ,transform(col('UserProperties'), lambda x: x['key'] ).alias('UserProperties_keys') \
                     ,transform(col('UserProperties'), lambda x: coalesce(  x['value']['string_value'] \
                                                                        , x['value']['int_value'] 
                                                                       ).cast('string')).alias('UserProperties_values')
                     ,transform(col('EventParams'), lambda x: x['key'] ).alias('EventParams_keys') \
                     ,transform(col('EventParams'), lambda x: coalesce(  x['value']['string_value'] \
                                                                        , x['value']['int_value'] \
                                                                       ).cast('string')).alias('EventParams_values')
                    )
        
        
        
#convert to a map
plain_df = df.select(  F.col('date') \
                      ,F.col('eventName') \
                      ,F.col('country') \
                      ,F.col('city') \
                      ,map_from_arrays(col('UserProperties_keys'), col('UserProperties_values')).alias('map_UserProperties') \
                      ,map_from_arrays(col('EventParams_keys'), col('EventParams_values')).alias('map_EventParams') 
                    )
        
#getItems from map
plain_df = plain_df.select( F.col('date') \
                           ,F.col('eventName') \
                           ,F.col('country') \
                           ,F.col('city') \
                           ,col('map_UserProperties').getItem('Name').alias('Name') \
                           ,col('map_UserProperties').getItem('Age').alias('Age') \
                           ,col('map_UserProperties').getItem('Gender').alias('Gender') \
                           ,col('map_EventParams').getItem('Website').alias('Website') \
                           ,col('map_EventParams').getItem('pageTitle').alias('pageTitle') 
                          )

I would like to know if there is a dynamical way to obtain the items, because it is not going to appear in a specific order and I can have more or less items on the json array UserProperties or EventParams.

Any ideas are welcome. Thanks!

Upvotes: 1

Related Questions