RData
RData

Reputation: 969

Read complex JSON to extract key values

I have a JSON and I'm trying to read part of it to extract keys and values. Assuming response is my JSON data, here is my code:

data_dump = json.dumps(response) 
data = json.loads(data_dump)

Here my data object becomes a list and I'm trying to get the keys as below

id = [key for key in data.keys()]

This fails with the error:

A list object does not have an attribute keys**. How can I get over this to get my below output?

Here is my JSON:

   {
"1": {
    "task": [
        "wakeup",
        "getready"
    ]
},
"2": {
    "task": [
        "brush",
        "shower"
    ]
},
"3": {
    "task": [
        "brush",
        "shower"
    ]
},
"activites": ["standup", "play", "sitdown"],
"statuscheck": {
    "time": 60,
    "color": 1002,
    "change(me)": 9898
},
"action": ["1", "2", "3", "4"]
}

The output I need is as below. I do not need data from the rest of JSON.

id task
1 wakeup, getready
2 brush , shower

Upvotes: 1

Views: 426

Answers (1)

ZygD
ZygD

Reputation: 24488

If you know that the keys you need are "1" and "2", you could try reading the JSON string as a dataframe, unpivoting it, exploding and grouping:

from pyspark.sql import functions as F

df = (spark.read.json(sc.parallelize([data_dump]))
    .selectExpr("stack(2, '1', `1`, '2', `2`) (id, task)")
    .withColumn('task', F.explode('task.task'))
    .groupBy('id').agg(F.collect_list('task').alias('task'))
)
df.show()
# +---+------------------+
# | id|              task|
# +---+------------------+
# |  1|[wakeup, getready]|
# |  2|   [brush, shower]|
# +---+------------------+

However, it may be easier to deal with it in Python:

data = json.loads(data_dump)
data2 = [(k, v['task']) for k, v in data.items() if k in ['1', '2']]
df = spark.createDataFrame(data2, ['id', 'task'])

df.show()
# +---+------------------+
# | id|              task|
# +---+------------------+
# |  1|[wakeup, getready]|
# |  2|   [brush, shower]|
# +---+------------------+

Upvotes: 1

Related Questions