gdm
gdm

Reputation: 7978

Pandas Read Nested Json Data

I have this JSON string:

{
    "2022-09-28T00:45:00.000Z": [
        {
            "value": 0.216,
            "plantId": "27050937",
            "man": "pippo"
        },
        {
            "value": 0.224,
            "plantId": "30082443",
            "man": "pippo"
        }
    ],
    "2022-09-28T00:30:00.000Z": [
        {
            "value": 0.248,
            "plantId": "27050937",
            "man": "pippo"
        },
        {
            "value": 0.108,
            "plantId": "30082443",
            "man": "pippo"
        }
    ]
}

I would like to import the data in order to build a numpy matrix indexed as (plantId, time). For example, I would get a matrix like:

  300082443 0.216 0.108
  27050937  0.224 0.248
  ....

I tried with pandas:

a =pd.read_json("./data.json",orient='index')

But I get:

2022-09-28 00:45:00+00:00  {'value': 0.216, 'plantId': '27050937', 'man...  {'value': 0.224, 'plantId': '30082443', 'man...
2022-09-28 00:30:00+00:00  {'value': 0.248, 'plantId': '27050937', 'man...  {'value': 0.108, 'plantId': '30082443', 'man...

Upvotes: 1

Views: 146

Answers (2)

jezrael
jezrael

Reputation: 863631

Use list comprehension for merge dates (outer keys) to nested dictionaries and pass to DataFrame constructor:

import json

with open('./data.json') as data_file:    
    data = json.load(data_file)  


df = pd.DataFrame([{**{'date': k}, **x} for k, v in data.items() for x in v])
print (df)
                       date  value   plantId    man
0  2022-09-28T00:45:00.000Z  0.216  27050937  pippo
1  2022-09-28T00:45:00.000Z  0.224  30082443  pippo
2  2022-09-28T00:30:00.000Z  0.248  27050937  pippo
3  2022-09-28T00:30:00.000Z  0.108  30082443  pippo

EDIT: Next step if need reshape DataFrame by plantId and value columns by GroupBy.cumcount and DataFrame.pivot:

df1 = (df.assign(g = df.groupby('plantId').cumcount().add(1))
         .pivot('plantId','g','value')
         .add_prefix('value'))
print (df1)
g         value1  value2
plantId                 
27050937   0.216   0.248
30082443   0.224   0.108

Or if need pivoting by date in columns:

df2 = df.pivot('plantId','date','value')
#df2 = df.pivot_table(index='plantId',columns='date',values='value', aggfunc='mean')
print (df2)
date      2022-09-28T00:30:00.000Z  2022-09-28T00:45:00.000Z
plantId                                                     
27050937                     0.248                     0.216
30082443                     0.108                     0.224

Upvotes: 2

Bushmaster
Bushmaster

Reputation: 4608

@jezrael's answer is much shorter and more efficient.

Combining the two answers:

df = pd.DataFrame([{**{'date': k}, **x} for k, v in a.items() for x in v]) #jezrael's answer

df2=df.groupby('plantId')['value'].agg(list).reset_index()
df2[['value_1','value_2']] = pd.DataFrame(df2.value.tolist(), index= df2.index)
df2
'''
plantId           value  value_1  value_2
0  27050937  [0.216, 0.248]    0.216    0.248
1  30082443  [0.224, 0.108]    0.224    0.108
'''

details (long solution):

a={
    "2022-09-28T00:45:00.000Z": [
        {
            "value": 0.216,
            "plantId": "27050937",
            "man": "pippo"
        },
        {
            "value": 0.224,
            "plantId": "30082443",
            "man": "pippo"
        }
    ],
    "2022-09-28T00:30:00.000Z": [
        {
            "value": 0.248,
            "plantId": "27050937",
            "man": "pippo"
        },
        {
            "value": 0.108,
            "plantId": "30082443",
            "man": "pippo"
        }
    ]
}

df=pd.DataFrame(a)

df: df

    2022-09-28T00:45:00.000Z                                2022-09-28T00:30:00.000Z
0   {'value': 0.216, 'plantId': '27050937', 'man': 'pippo'} {'value': 0.248, 'plantId': '27050937', 'man': 'pippo'}
1   {'value': 0.224, 'plantId': '30082443', 'man': 'pippo'} {'value': 0.108, 'plantId': '30082443', 'man': 'pippo'}

As the data increases, the number of columns will also increase. Now we need to convert these columns to rows.

df = df.T.reset_index() # T = transpose()

We used the transpose function to convert columns to rows. latest version: latest version

        index                       0                                       1
0       2022-09-28T00:45:00.000Z    {'value': 0.216, 'plantId': '27050937', 'man': 'pippo'} {'value': 0.224, 'plantId': '30082443', 'man': 'pippo'}
1       2022-09-28T00:30:00.000Z    {'value': 0.248, 'plantId': '27050937', 'man': 'pippo'} {'value': 0.108, 'plantId': '30082443', 'man': 'pippo'}

now let's parse the dictonary objects (columns 0 and 1) in the rows into new columns. We will use the json_normalize function for this.

first= pd.json_normalize(df.pop(0))
df=df.join(first)
second=pd.json_normalize(df.pop(1))
df = pd.concat([df,second])
print(df)
'''
                      index  value   plantId    man
0  2022-09-28T00:45:00.000Z  0.216  27050937  pippo
1  2022-09-28T00:30:00.000Z  0.248  27050937  pippo
0                       NaN  0.224  30082443  pippo
1                       NaN  0.108  30082443  pippo

'''
#if you want to fill NaNs use:
df = df.fillna({'index':df[['index']][df['index'].notnull()].squeeze()})
#fill nans with same index number.
'''
                      index  value   plantId    man
0  2022-09-28T00:45:00.000Z  0.216  27050937  pippo
1  2022-09-28T00:30:00.000Z  0.248  27050937  pippo
0  2022-09-28T00:45:00.000Z  0.224  30082443  pippo
1  2022-09-28T00:30:00.000Z  0.108  30082443  pippo
'''


json_normalize returns only the processed column values. But we also need other columns so we use join function. Then we should do the same operation in the column named 1. but here I am using concat instead of join because the column names are the same and I want the data to be inserted one below the other (as rows not as columns). finally group by plant_id and add the values ​​to the list then split the list items into new columns.

df2=df.groupby('plantId')['value'].agg(list).reset_index()
df2[['value_1','value_2']] = pd.DataFrame(df2.value.tolist(), index= df2.index)
print(df2)
'''
    plantId           value  value_1  value_2
0  27050937  [0.216, 0.248]    0.216    0.248
1  30082443  [0.224, 0.108]    0.224    0.108
'''

Upvotes: 1

Related Questions