Sonya
Sonya

Reputation: 167

read in nested columns from json file into pandas df python

I need to read in a JSON file into pandas df. JSON data looks like this:

{"f0_":{"id":"138307057680","ActionName":"Complete","Time":"2020-04-23-12:40:04"}}
{"f0_":{"id":"138313115245","ActionName":"Midpoint","Time":"2020-06-16-20:41:16"}}

I need to get rid of the first key that holds all columns in it. I tried:

import json 
import pandas as pd 
from pandas.io.json import json_normalize

data_pd = pd.read_json('db/my_file.json', lines=True) 
new_data = json_normalize(data_pd)

The error message is: AttributeError: 'str' object has no attribute 'values' The desired output is:

id                ActionName       Time 
138307057680       Complete       2020-04-23-12:40:04
138313115245       Midpoint       2020-06-16-20:41:16

Currently if I print out data_pd, the output is: enter image description here

Upvotes: 1

Views: 55

Answers (2)

Chiheb Nexus
Chiheb Nexus

Reputation: 9267

You can clean the data passed to Pandas before generating your dataframe like this example:

import json
import pandas as pd


def gen_data(file_path):
    with open(file_path) as f:
        for line in f.readlines():
            if line:
                line = json.loads(line)
                for value in line.values():
                    yield value


df = pd.DataFrame(gen_data('db/my_file.json'))
print(df)

Output:

             id ActionName                 Time
0  138307057680   Complete  2020-04-23-12:40:04
1  138313115245   Midpoint  2020-06-16-20:41:16

Bonus:

A bit of speed comparaison (I'm using an i7):

If you Clean your data first then generate your DF:

>> %timeit pd.DataFrame(gen_data('db/my_file.json'))
519 µs ± 1.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

And if you generate your DF then clean it:

import pandas as pd


def gen_df_method2(file_path):
    data_pd = pd.read_json(file_path, lines=True)
    return pd.DataFrame(data_pd['f0_'].values.tolist())


>> %timeit gen_df_method2('db/my_file.json')
2.66 ms ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150815

You can try with:

new_data = pd.DataFrame(data_pd['f0_'].values.tolist())

Output:

             id ActionName                 Time
0  138307057680   Complete  2020-04-23-12:40:04
1  138313115245   Midpoint  2020-06-16-20:41:16

Upvotes: 1

Related Questions