Reputation: 1373
I have a list:
data_list= [
{
'fields': {
'standard': ['ADSL1'],
'serviceCode': ['BNG_DSL'],
'deltaUpload': [0]
}
},
{
'fields': {
'standard': ['ADSL1'],
'serviceCode': ['BNG_DSL'],
'deltaUpload': [545618]
}
},
{
'fields': {
'standard': ['ADSL1'],
'serviceCode': ['BNG_DSL'],
'deltaUpload': [597561]
}
},
{
'fields': {
'standard': ['ADSL1'],
'serviceCode': ['BNG_DSL'],
'deltaUpload': [323771]
}
},
{
'fields': {
'standard': ['ADSL1'],
'serviceCode': ['BNG_DSL'],
'deltaUpload': [1088]
}
}
]
From this list I have created a dataframe:
From df.fields series I would like to create a new dataframe. I tried as:
Everything is OK, but I don't want each cell to be in the list. I tried:
...which works but is very slow due to tenth of millions of data rows. Is there vectorized way to create fields dataframe out of data_list with no cell lists?
Upvotes: 0
Views: 134
Reputation: 28669
the json_normalize function in pandas is not exceptionally fast. I would suggest you reduce the initial data to a list or dict, then create your dataframe:
d = [(i['fields']['standard'][0],
i['fields']['serviceCode'][0],
i['fields']['deltaUpload'][0])
if i['fields']['standard'] else ''
if i['fields']['serviceCode'] else ''
if i['fields']['deltaUpload'] else ''
for i in data_list ]
d
[('ADSL1', 'BNG_DSL', 0),
('ADSL1', 'BNG_DSL', 1088),
('ADSL1', 'BNG_DSL', 323771),
('ADSL1', 'BNG_DSL', 545618),
('ADSL1', 'BNG_DSL', 597561)]
df = pd.DataFrame(d)
df.columns = ['standard','serviceCode','deltaUpload']
standard serviceCode deltaUpload
0 ADSL1 BNG_DSL 545618
1 ADSL1 BNG_DSL 1088
2 ADSL1 BNG_DSL 323771
3 ADSL1 BNG_DSL 597561
4 ADSL1 BNG_DSL 0
Time to run on my PC : 596 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each).
Time passes and we learn more : shorter code, cleaner :
from collections import defaultdict
d = defaultdict(list)
for entry in data_list:
for k,v in entry['fields'].items():
d[k].append(v[0])
pd.DataFrame(d)
Upvotes: 0
Reputation: 75080
You can use:
df = pd.DataFrame(data_list)
out = pd.DataFrame(df['fields'].tolist()).stack().str[0].unstack()
standard serviceCode deltaUpload
0 ADSL1 BNG_DSL 0
1 ADSL1 BNG_DSL 545618
2 ADSL1 BNG_DSL 597561
3 ADSL1 BNG_DSL 323771
4 ADSL1 BNG_DSL 1088
We can also try inferring the correct dtypes by using infer_objects
out1 = (pd.DataFrame(df['fields'].tolist()).stack().str[0]
.astype(object).unstack().infer_objects())
We also can do pd.io.json.json_normalize
directly:
df = pd.io.json.json_normalize(data_list).rename(columns=lambda x: x.split('.')[1])
df.stack().str[0].astype(object).unstack().infer_objects()
print(out.dtypes)
#standard object
#serviceCode object
#deltaUpload object
#dtype: object
print(out1.dtypes)
#standard object
#serviceCode object
#deltaUpload int64
dtype: object
Upvotes: 1