Leyth G
Leyth G

Reputation: 1143

How to explode a json string to multiple rows in spark?

I am trying to parse some unstructured JSON data into multiple rows using spark. I've looked up many examples, but none of them seem to be working for this scenario. I'm not sure if my JSON data is invalid or if I am doing it wrong.

I am using spark 2.4 and here is the source dataframe:

+------+---------------------------------------------------------------------------------------------------------------------------+
|userId|values                                                                                                                     
|
+------+---------------------------------------------------------------------------------------------------------------------------+
|1     |[{"mouseCode":0,"target":"TR#test-15110>TD>DIV>SPAN>LABEL"},{"keyCode":96,"target":"INPUT#next-button[value='Save Data']"}]|
|2     |[{"keyCode":53,"target":"TR#box-15110>TD>DIV>SPAN>LABEL"}]                                                                 |
+------+---------------------------------------------------------------------------------------------------------------------------+

I would like the resulting data frame to look like this:

+------+---------------------------------------------------------------------------------------------------------------------------+
|userId|values                                                                                                                     
|
+------+---------------------------------------------------------------------------------------------------------------------------+
|1     |{"mouseCode":0,"target":"TR#test-15110>TD>DIV>SPAN>LABEL"}|
|1     |{"keyCode":96,"target":"INPUT#next-button[value='Save Data']"}|
|2     |{"keyCode":53,"target":"TR#box-15110>TD>DIV>SPAN>LABEL"}                                                                 |
+------+---------------------------------------------------------------------------------------------------------------------------+

So, it's an explode where we don't know how many possible values can exist, but the schema of the source data frame looks like this:

root
|-- userId: integer (nullable = false)
|-- values: string (nullable = true)

df.select(explode($"values").as("explodedValues"))

Since it's a string, this won't work as it doesn't know how to split the string out. So, I am thinking I need to convert it to a list/array of those objects? I attempted to make a UDF that splits on "[},{]" but that ended up splitting on every single comma instead.

Any help or guidance in the right direction is greatly appreciated. Thank you.

Upvotes: 0

Views: 1225

Answers (1)

M. Alexandru
M. Alexandru

Reputation: 624

You don't need an udf to do this, just do it like this:

yourDF.withColumn("newValues", explode(split(regexp_replace($"values", "\\[|\\]", ""), ","))).drop("values")
.show(false)

Or to be sure you don't have more commas in your string just double your curly brackets like this, and after that explode the string after each "},{" pair .

df2.withColumn("newValues", explode(split(regexp_replace(regexp_replace($"values", "\\[|\\]", ""), "},\\{", "}},\\{\\{"), "},\\{"))).drop("values")
.show(false)

Avoid as much as you can UDF's because they are slower than built in functions.

I hope this helps you.

Upvotes: 1

Related Questions