Reputation: 33
I would need to process a json file with the following schema:
root
|-- Header: struct (nullable = true)
| |-- Format: string (nullable = true)
| |-- Version: struct (nullable = true)
| | |-- vfield: string (nullable = true)
|-- Payload: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Data: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: long (containsNull = true)
| | |-- Event: struct (nullable = true)
| | | |-- eventCount: long (nullable = true)
| | | |-- eventName: string (nullable = true)
When I load it into a DataFrame there is only one Row, but that row contains plenty of data and event elements in the Payload array. (One element has either data or event but never both)
I would like to get all the events so I could perform some further operations on them or maybe loading them later in a DB table etc... In order to do that I will need all the elements of Payload that has Event and I won't need the one that only has "Data". Best would be to have a DataFrame in the end that contains rows with only the members of Event...
Unfortunately when I tried something like this:
df.select("Payload.Event")
or df.select(`Payload`).filter(...)
Then it was still filtering on the root but since there is only one row in the DataFrame that was not very helpful. How can I filter the internal array, and get its elements as a separate Dataframe?
Sample json:
{
"Header": {
"Version": {
"vfield": "0.6"
},
"Format": "DEFAULT"
},
"Payload": [
{"Data": [
[0, 1, 2],
[5, 6]
]},
{"Event": {
"eventName" : "event1",
"eventCount": 123
}},
{"Event": {
"eventName" : "event2",
"eventCount": 124
}},
{ "Data": [
[5,8],
[1,2,6]
] }
]
}
Upvotes: 0
Views: 586
Reputation: 10382
Because Payload
is of type array
, If you access anything without explode
will give you result of type array
Change df.select("Payload.Event")
to df.withColumn("Payload",explode("Payload")).select("Payload.Event")
Check below code.
scala> df.printSchema
root
|-- Header: struct (nullable = true)
| |-- Format: string (nullable = true)
| |-- Version: struct (nullable = true)
| | |-- vfield: string (nullable = true)
|-- Payload: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Data: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: long (containsNull = true)
| | |-- Event: struct (nullable = true)
| | | |-- eventCount: long (nullable = true)
| | | |-- eventName: string (nullable = true)
scala> df.withColumn("Payload",explode($"Payload")).select("Payload.Event").printSchema
root
|-- Event: struct (nullable = true)
| |-- eventCount: long (nullable = true)
| |-- eventName: string (nullable = true)
scala> df.withColumn("Payload",explode($"Payload")).select("Payload.Event.*").printSchema
root
|-- eventCount: long (nullable = true)
|-- eventName: string (nullable = true)
Upvotes: 2