Reputation: 170
Assume the following snip of a JSON file to be flattened on Python.
{
"locations" : [ {
"timestampMs" : "1549913792265",
"latitudeE7" : 323518421,
"longitudeE7" : -546166813,
"accuracy" : 13,
"altitude" : 1,
"verticalAccuracy" : 2,
"activity" : [ {
"timestampMs" : "1549913286057",
"activity" : [ {
"type" : "STILL",
"confidence" : 100
} ]
}, {
"timestampMs" : "1549913730454",
"activity" : [ {
"type" : "DRIVING",
"confidence" : 100
} ]
} ]
}, {
"timestampMs" : "1549912693813",
"latitudeE7" : 323518421,
"longitudeE7" : -546166813,
"accuracy" : 13,
"altitude" : 1,
"verticalAccuracy" : 2,
"activity" : [ {
"timestampMs" : "1549911547308",
"activity" : [ {
"type" : "ACTIVE",
"confidence" : 100
} ]
}, {
"timestampMs" : "1549912330473",
"activity" : [ {
"type" : "BIKING",
"confidence" : 100
} ]
} ]
} ]
}
The goal is to turn it into a flattened dataframe like this:
location_id timestampMs ... verticalAccuracy activity_timestampMs activity_activity_type ...
1 1549913792265 13 1549913286057 "STILL"
1 1549913792265 13 1549913730454 "DRIVING"
etc.
How would one do so given that the key 'activity' is repeated at different nest levels?
Upvotes: 2
Views: 1860
Reputation: 11105
Here is a solution using json_normalize
(documentation), assuming the JSON snippet you posted is in a python dictionary named d
.
from pandas.io.json import json_normalize
# Build a list of paths to JSON fields that will end up as metadata
# in the final DataFrame
meta = list(js['locations'][0].keys())
# meta is now this:
# ['timestampMs',
# 'latitudeE7',
# 'longitudeE7',
# 'accuracy',
# 'altitude',
# 'verticalAccuracy',
# 'activity']
# Almost correct. We need to remove 'activity' and append
# the list ['activity', 'timestampMs'] to meta.
meta.remove('activity')
meta.append(['activity', 'timestampMs'])
# meta is now this:
# ['timestampMs',
# 'latitudeE7',
# 'longitudeE7',
# 'accuracy',
# 'altitude',
# 'verticalAccuracy',
# ['activity', 'timestampMs']]
# Use json_normalize on the list of dicts
# that lives at d['locations'], passing in
# the appropriate record path and metadata
# paths, and specifying the double 'activity_'
# record prefix.
json_normalize(d['locations'],
record_path=['activity', 'activity'],
meta=meta,
record_prefix='activity_activity_')
activity_activity_confidence activity_activity_type timestampMs latitudeE7 longitudeE7 accuracy altitude verticalAccuracy activity.timestampMs
0 100 STILL 1549913792265 323518421 -546166813 13 1 2 1549913286057
1 100 DRIVING 1549913792265 323518421 -546166813 13 1 2 1549913730454
2 100 ACTIVE 1549912693813 323518421 -546166813 13 1 2 1549911547308
3 100 BIKING 1549912693813 323518421 -546166813 13 1 2 1549912330473
If the ['activity', 'activity']
record path is sometimes missing, the above code will throw an error. The following workaround should work for this specific case, but is brittle and might be unacceptably slow depending on the size of your input data:
# Create an example by deleting one of the 'activity' paths
# from the original dict
del d['locations'][0]['activity']
pd.concat([json_normalize(x,
record_path=['activity', 'activity']
if 'activity' in x.keys() else None,
meta=meta,
record_prefix='activity_activity_')
for x in d['locations']],
axis=0,
ignore_index=True,
sort=False)
accuracy altitude latitudeE7 longitudeE7 timestampMs verticalAccuracy activity_activity_confidence activity_activity_type activity.timestampMs
0 13 1 323518421 -546166813 1549913792265 2 NaN NaN NaN
1 13 1 323518421 -546166813 1549912693813 2 100.0 ACTIVE 1549911547308
2 13 1 323518421 -546166813 1549912693813 2 100.0 BIKING 1549912330473
Upvotes: 1