nate
nate

Reputation: 1244

SparkR, split a column of nested JSON strings into columns

I am coming from R, new to SparkR, and trying to split a SparkDataFrame column of JSON strings into respective columns. The columns in the Spark DataFrame are arrays with a schema like this:

> printSchema(tst)
root
 |-- FromStation: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- ToStation: array (nullable = true)
 |    |-- element: string (containsNull = true)

If I look at the data in the viewer, View(head(tst$FromStation)) I can see the SparkDataFrame's FromStation column has a form like this in each row:

list("{\"Code\":\"ABCDE\",\"Name\":\"StationA\"}", "{\"Code\":\"WXYZP\",\"Name\":\"StationB\"}", "{...

Where the ... indicates the pattern repeats an unknown amount of times.

My Question

How do I extract this information and put it in a flat dataframe? Ideally, I would like to make a FromStationCode and FromStationName column for each observation in the nested array column. I have tried various combinations of explode and getItem...but to no avail. I keep getting a data type mismatch error. I've searched through examples of other people with this challenge in Spark, but SparkR examples are more scarce. I'm hoping someone with more experience using Spark/SparkR could provide some insight.

Many thanks, nate

Upvotes: 2

Views: 640

Answers (1)

Sergio Alyoshkin
Sergio Alyoshkin

Reputation: 212

I guess you need to convert tst into usual R object

df = collect(tst)

Then you operate with df like with any other R data.frame

Upvotes: 0

Related Questions