user3225309
user3225309

Reputation: 1373

Python pandas, how to unpack pandas column with json to new dataframe in most efficient way?

I have a list:

data_list= [
        {
           'fields': {
               'standard': ['ADSL1'],
               'serviceCode': ['BNG_DSL'],
               'deltaUpload': [0]
                    }
        },
        {
            'fields': {
                'standard': ['ADSL1'],
                'serviceCode': ['BNG_DSL'],
                'deltaUpload': [545618]
                }
        },
        {
            'fields': {
                'standard': ['ADSL1'],
                'serviceCode': ['BNG_DSL'],
                'deltaUpload': [597561]
                }
        },
        {
            'fields': {
                'standard': ['ADSL1'],
                'serviceCode': ['BNG_DSL'],
                'deltaUpload': [323771]
                }
        },    
        {
           'fields': {
               'standard': ['ADSL1'],
               'serviceCode': ['BNG_DSL'],
               'deltaUpload': [1088]
                }
        }
    ]

From this list I have created a dataframe:

enter image description here

From df.fields series I would like to create a new dataframe. I tried as:

enter image description here

Everything is OK, but I don't want each cell to be in the list. I tried:

enter image description here

...which works but is very slow due to tenth of millions of data rows. Is there vectorized way to create fields dataframe out of data_list with no cell lists?

Upvotes: 0

Views: 134

Answers (2)

sammywemmy
sammywemmy

Reputation: 28669

the json_normalize function in pandas is not exceptionally fast. I would suggest you reduce the initial data to a list or dict, then create your dataframe:

d = [(i['fields']['standard'][0],
      i['fields']['serviceCode'][0],
      i['fields']['deltaUpload'][0]) 
     if i['fields']['standard'] else '' 
     if i['fields']['serviceCode'] else '' 
     if i['fields']['deltaUpload'] else ''
     for i in data_list ]

d

[('ADSL1', 'BNG_DSL', 0),
 ('ADSL1', 'BNG_DSL', 1088),
 ('ADSL1', 'BNG_DSL', 323771),
 ('ADSL1', 'BNG_DSL', 545618),
 ('ADSL1', 'BNG_DSL', 597561)]

df = pd.DataFrame(d)

df.columns = ['standard','serviceCode','deltaUpload']


   standard  serviceCode    deltaUpload
0   ADSL1     BNG_DSL        545618
1   ADSL1     BNG_DSL        1088
2   ADSL1     BNG_DSL        323771
3   ADSL1     BNG_DSL        597561
4   ADSL1     BNG_DSL         0

Time to run on my PC : 596 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each).

Time passes and we learn more : shorter code, cleaner :

from collections import defaultdict
d = defaultdict(list)
for entry in data_list:
    for k,v in entry['fields'].items():
            d[k].append(v[0])

pd.DataFrame(d)

Upvotes: 0

anky
anky

Reputation: 75080

You can use:

df = pd.DataFrame(data_list)
out = pd.DataFrame(df['fields'].tolist()).stack().str[0].unstack()

  standard serviceCode deltaUpload
0    ADSL1     BNG_DSL           0
1    ADSL1     BNG_DSL      545618
2    ADSL1     BNG_DSL      597561
3    ADSL1     BNG_DSL      323771
4    ADSL1     BNG_DSL        1088

We can also try inferring the correct dtypes by using infer_objects

out1 = (pd.DataFrame(df['fields'].tolist()).stack().str[0]
                  .astype(object).unstack().infer_objects())

We also can do pd.io.json.json_normalize directly:

df = pd.io.json.json_normalize(data_list).rename(columns=lambda x: x.split('.')[1])
df.stack().str[0].astype(object).unstack().infer_objects()

print(out.dtypes)
#standard       object
#serviceCode    object
#deltaUpload    object
#dtype: object

print(out1.dtypes)
#standard       object
#serviceCode    object
#deltaUpload     int64
dtype: object

Upvotes: 1

Related Questions