gambiteer
gambiteer

Reputation: 33

Spark JSON nested array to DataFrame

I would need to process a json file with the following schema:

root
 |-- Header: struct (nullable = true)
 |    |-- Format: string (nullable = true)
 |    |-- Version: struct (nullable = true)
 |    |    |-- vfield: string (nullable = true)
 |-- Payload: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Data: array (nullable = true)
 |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |-- Event: struct (nullable = true)
 |    |    |    |-- eventCount: long (nullable = true)
 |    |    |    |-- eventName: string (nullable = true)

When I load it into a DataFrame there is only one Row, but that row contains plenty of data and event elements in the Payload array. (One element has either data or event but never both)

I would like to get all the events so I could perform some further operations on them or maybe loading them later in a DB table etc... In order to do that I will need all the elements of Payload that has Event and I won't need the one that only has "Data". Best would be to have a DataFrame in the end that contains rows with only the members of Event...

Unfortunately when I tried something like this:

df.select("Payload.Event") or df.select(`Payload`).filter(...)

Then it was still filtering on the root but since there is only one row in the DataFrame that was not very helpful. How can I filter the internal array, and get its elements as a separate Dataframe?

Sample json:

{
    "Header": {
        "Version": {
            "vfield": "0.6"
        },
        "Format": "DEFAULT"
    },
    "Payload": [
        {"Data": [
            [0, 1, 2],
            [5, 6]
        ]},

        {"Event": {
            "eventName" : "event1",
            "eventCount": 123
        }},
        {"Event": {
            "eventName" : "event2",
            "eventCount": 124
        }},
        { "Data": [
            [5,8],
            [1,2,6]
        ] }
    ]        
}    

Upvotes: 0

Views: 586

Answers (1)

s.polam
s.polam

Reputation: 10382

Because Payload is of type array, If you access anything without explode will give you result of type array

Change df.select("Payload.Event") to df.withColumn("Payload",explode("Payload")).select("Payload.Event")

Check below code.

scala> df.printSchema
root
 |-- Header: struct (nullable = true)
 |    |-- Format: string (nullable = true)
 |    |-- Version: struct (nullable = true)
 |    |    |-- vfield: string (nullable = true)
 |-- Payload: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Data: array (nullable = true)
 |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |-- Event: struct (nullable = true)
 |    |    |    |-- eventCount: long (nullable = true)
 |    |    |    |-- eventName: string (nullable = true)


scala> df.withColumn("Payload",explode($"Payload")).select("Payload.Event").printSchema
root
 |-- Event: struct (nullable = true)
 |    |-- eventCount: long (nullable = true)
 |    |-- eventName: string (nullable = true)


scala> df.withColumn("Payload",explode($"Payload")).select("Payload.Event.*").printSchema
root
 |-- eventCount: long (nullable = true)
 |-- eventName: string (nullable = true)

Upvotes: 2

Related Questions