Spark split nested json into rows

Question

I am trying to split some fairly complicated nested json into a more rationalised format but I'm struggling with expanding out a key that changes name throughout my dataset.

My dataset looks something like this:

{
    "account": {
       "accountID":  "test_account",
        "name": "abc123",
        "checks": {
            "abc123": {
                "check1":  "pass",
                "check2": "fail",
                "check3": 0
               },
            "xzy7892": {
                "check1":  "pass",
                "check2": "fail",
                "check3": 0,
                "result": { 
                    "item1": 1,
                    "item2": 2
                }
            },
            "foobar11012387": {
                "check1":  "fail",
                "check2": "pass",
                "check3": 0,
                "result": {
                    "item1": 1,
                    "item2": 2
                    }
                }
            }
        }
}

I have no control over the schema and thus can only work with what I'm given. Essentially, the payload is broken down into various checks, but each check has a unique name (abc123, xzy7892 and foobar1012387 in the sample payload).

The account, accountID and checks keys are straight forward to select from a dataframe.

df2.select(['account.accountID', 'account.checks']).show()
+------------+--------------------+
|   accountID|              checks|
+------------+--------------------+
|test_account|[[pass, fail, 0],...|
+------------+--------------------+

But I can go any further than that point (ie. account.checks.abc123.check1). Ultimately I want to rationalise these three checks into their own rows in the dataframe but I'm not quite sure how to go about it since the check key changes.

| accountID | check_name | check1 | check2 | check3 | result |
+-----------|------------|--------|--------|--------|--------|
| test_account | abc123  | pass   | fail   | 0      | null   |
| test_account | xyz7892 | pass   | fail   | 0      | [1, 2] |
| test_account | foobar11012387 | fail   | pass   | 0      | [1, 2] |

I would want the DF to look similar to the table above (I didn't expand result but I could go futher). I don't know the names of the test (ie. abc123, xzy7892) ahead of time and they do change so perhaps I need to build an array first.

Any thoughts?

Ramesh Maharjan · Accepted Answer

If your input dataframe and schema are as below

+-------------------------------------------------------------------------------------------+
|account                                                                                    |
+-------------------------------------------------------------------------------------------+
|[test_account, [[pass, fail, 0], [fail, pass, 0, [1, 2]], [pass, fail, 0, [1, 2]]], abc123]|
+-------------------------------------------------------------------------------------------+

root
 |-- account: struct (nullable = true)
 |    |-- accountID: string (nullable = true)
 |    |-- checks: struct (nullable = true)
 |    |    |-- abc123: struct (nullable = true)
 |    |    |    |-- check1: string (nullable = true)
 |    |    |    |-- check2: string (nullable = true)
 |    |    |    |-- check3: long (nullable = true)
 |    |    |-- foobar11012387: struct (nullable = true)
 |    |    |    |-- check1: string (nullable = true)
 |    |    |    |-- check2: string (nullable = true)
 |    |    |    |-- check3: long (nullable = true)
 |    |    |    |-- result: struct (nullable = true)
 |    |    |    |    |-- item1: long (nullable = true)
 |    |    |    |    |-- item2: long (nullable = true)
 |    |    |-- xzy7892: struct (nullable = true)
 |    |    |    |-- check1: string (nullable = true)
 |    |    |    |-- check2: string (nullable = true)
 |    |    |    |-- check3: long (nullable = true)
 |    |    |    |-- result: struct (nullable = true)
 |    |    |    |    |-- item1: long (nullable = true)
 |    |    |    |    |-- item2: long (nullable = true)
 |    |-- name: string (nullable = true)

You can use struct array and explode functions as below to get your desired output as

checks1 = ['abc123', 'foobar11012387', 'xzy7892']
checks2 = ['check1', 'check2', 'check3']

from pyspark.sql import functions as f
df.select(f.col('account.accountID'), f.explode(f.array(*[f.struct([f.col('account.checks.'+y+'.'+x).cast('string').alias(y) for y in checks1]).alias(x) for x in checks2])).alias('temp'))\
    .select(f.col('accountID'), f.col('temp.*'))\
    .show(truncate=False)

which should give you

+------------+------+--------------+-------+
|accountID   |abc123|foobar11012387|xzy7892|
+------------+------+--------------+-------+
|test_account|pass  |fail          |pass   |
|test_account|fail  |pass          |fail   |
|test_account|0     |0             |0      |
+------------+------+--------------+-------+

I hope the answer is helpful

Spark split nested json into rows

Answers (1)

Related Questions