Reputation: 569
I am trying to split some fairly complicated nested json into a more rationalised format but I'm struggling with expanding out a key that changes name throughout my dataset.
My dataset looks something like this:
{
"account": {
"accountID": "test_account",
"name": "abc123",
"checks": {
"abc123": {
"check1": "pass",
"check2": "fail",
"check3": 0
},
"xzy7892": {
"check1": "pass",
"check2": "fail",
"check3": 0,
"result": {
"item1": 1,
"item2": 2
}
},
"foobar11012387": {
"check1": "fail",
"check2": "pass",
"check3": 0,
"result": {
"item1": 1,
"item2": 2
}
}
}
}
}
I have no control over the schema and thus can only work with what I'm given. Essentially, the payload is broken down into various checks, but each check has a unique name (abc123, xzy7892 and foobar1012387 in the sample payload).
The account, accountID and checks keys are straight forward to select from a dataframe.
df2.select(['account.accountID', 'account.checks']).show()
+------------+--------------------+
| accountID| checks|
+------------+--------------------+
|test_account|[[pass, fail, 0],...|
+------------+--------------------+
But I can go any further than that point (ie. account.checks.abc123.check1). Ultimately I want to rationalise these three checks into their own rows in the dataframe but I'm not quite sure how to go about it since the check key changes.
| accountID | check_name | check1 | check2 | check3 | result |
+-----------|------------|--------|--------|--------|--------|
| test_account | abc123 | pass | fail | 0 | null |
| test_account | xyz7892 | pass | fail | 0 | [1, 2] |
| test_account | foobar11012387 | fail | pass | 0 | [1, 2] |
I would want the DF to look similar to the table above (I didn't expand result but I could go futher). I don't know the names of the test (ie. abc123, xzy7892) ahead of time and they do change so perhaps I need to build an array first.
Any thoughts?
Upvotes: 2
Views: 1397
Reputation: 41987
If your input dataframe
and schema
are as below
+-------------------------------------------------------------------------------------------+
|account |
+-------------------------------------------------------------------------------------------+
|[test_account, [[pass, fail, 0], [fail, pass, 0, [1, 2]], [pass, fail, 0, [1, 2]]], abc123]|
+-------------------------------------------------------------------------------------------+
root
|-- account: struct (nullable = true)
| |-- accountID: string (nullable = true)
| |-- checks: struct (nullable = true)
| | |-- abc123: struct (nullable = true)
| | | |-- check1: string (nullable = true)
| | | |-- check2: string (nullable = true)
| | | |-- check3: long (nullable = true)
| | |-- foobar11012387: struct (nullable = true)
| | | |-- check1: string (nullable = true)
| | | |-- check2: string (nullable = true)
| | | |-- check3: long (nullable = true)
| | | |-- result: struct (nullable = true)
| | | | |-- item1: long (nullable = true)
| | | | |-- item2: long (nullable = true)
| | |-- xzy7892: struct (nullable = true)
| | | |-- check1: string (nullable = true)
| | | |-- check2: string (nullable = true)
| | | |-- check3: long (nullable = true)
| | | |-- result: struct (nullable = true)
| | | | |-- item1: long (nullable = true)
| | | | |-- item2: long (nullable = true)
| |-- name: string (nullable = true)
You can use struct
array
and explode
functions as below to get your desired output as
checks1 = ['abc123', 'foobar11012387', 'xzy7892']
checks2 = ['check1', 'check2', 'check3']
from pyspark.sql import functions as f
df.select(f.col('account.accountID'), f.explode(f.array(*[f.struct([f.col('account.checks.'+y+'.'+x).cast('string').alias(y) for y in checks1]).alias(x) for x in checks2])).alias('temp'))\
.select(f.col('accountID'), f.col('temp.*'))\
.show(truncate=False)
which should give you
+------------+------+--------------+-------+
|accountID |abc123|foobar11012387|xzy7892|
+------------+------+--------------+-------+
|test_account|pass |fail |pass |
|test_account|fail |pass |fail |
|test_account|0 |0 |0 |
+------------+------+--------------+-------+
I hope the answer is helpful
Upvotes: 1