Reputation: 1244
I am coming from R, new to SparkR, and trying to split a SparkDataFrame column of JSON strings into respective columns. The columns in the Spark DataFrame are arrays with a schema like this:
> printSchema(tst)
root
|-- FromStation: array (nullable = true)
| |-- element: string (containsNull = true)
|-- ToStation: array (nullable = true)
| |-- element: string (containsNull = true)
If I look at the data in the viewer, View(head(tst$FromStation))
I can see the SparkDataFrame's FromStation column has a form like this in each row:
list("{\"Code\":\"ABCDE\",\"Name\":\"StationA\"}", "{\"Code\":\"WXYZP\",\"Name\":\"StationB\"}", "{...
Where the ... indicates the pattern repeats an unknown amount of times.
My Question
How do I extract this information and put it in a flat dataframe? Ideally, I would like to make a FromStationCode
and FromStationName
column for each observation in the nested array column. I have tried various combinations of explode
and getItem
...but to no avail. I keep getting a data type mismatch error. I've searched through examples of other people with this challenge in Spark, but SparkR examples are more scarce. I'm hoping someone with more experience using Spark/SparkR could provide some insight.
Many thanks, nate
Upvotes: 2
Views: 640
Reputation: 212
I guess you need to convert tst into usual R object
df = collect(tst)
Then you operate with df like with any other R data.frame
Upvotes: 0