Maher Said
Maher Said

Reputation: 170

Converting nested JSON file with duplicate keys to dataframe in Python

Assume the following snip of a JSON file to be flattened on Python.

{
  "locations" : [ {
    "timestampMs" : "1549913792265",
    "latitudeE7" : 323518421,
    "longitudeE7" : -546166813,
    "accuracy" : 13,
    "altitude" : 1,
    "verticalAccuracy" : 2,
    "activity" : [ {
      "timestampMs" : "1549913286057",
      "activity" : [ {
        "type" : "STILL",
        "confidence" : 100
      } ]
    }, {
      "timestampMs" : "1549913730454",
      "activity" : [ {
        "type" : "DRIVING",
        "confidence" : 100
      } ]
    } ]
  }, {
    "timestampMs" : "1549912693813",
    "latitudeE7" : 323518421,
    "longitudeE7" : -546166813,
    "accuracy" : 13,
    "altitude" : 1,
    "verticalAccuracy" : 2,
    "activity" : [ {
      "timestampMs" : "1549911547308",
      "activity" : [ {
        "type" : "ACTIVE",
        "confidence" : 100
      } ]
    }, {
      "timestampMs" : "1549912330473",
      "activity" : [ {
        "type" : "BIKING",
        "confidence" : 100
      } ]
    } ]
  } ]
}

The goal is to turn it into a flattened dataframe like this:

location_id timestampMs ... verticalAccuracy activity_timestampMs activity_activity_type ...
1           1549913792265   13               1549913286057        "STILL"
1           1549913792265   13               1549913730454        "DRIVING"
etc.

How would one do so given that the key 'activity' is repeated at different nest levels?

Upvotes: 2

Views: 1860

Answers (1)

Peter Leimbigler
Peter Leimbigler

Reputation: 11105

Here is a solution using json_normalize (documentation), assuming the JSON snippet you posted is in a python dictionary named d.

from pandas.io.json import json_normalize

# Build a list of paths to JSON fields that will end up as metadata
# in the final DataFrame
meta = list(js['locations'][0].keys())

# meta is now this:
# ['timestampMs',
# 'latitudeE7',
# 'longitudeE7',
# 'accuracy',
# 'altitude',
# 'verticalAccuracy',
# 'activity']

# Almost correct. We need to remove 'activity' and append
# the list ['activity', 'timestampMs'] to meta.
meta.remove('activity')
meta.append(['activity', 'timestampMs'])

# meta is now this:
# ['timestampMs',
# 'latitudeE7',
# 'longitudeE7',
# 'accuracy',
# 'altitude',
# 'verticalAccuracy',
# ['activity', 'timestampMs']]

# Use json_normalize on the list of dicts
# that lives at d['locations'], passing in
# the appropriate record path and metadata
# paths, and specifying the double 'activity_'
# record prefix.
json_normalize(d['locations'], 
               record_path=['activity', 'activity'], 
               meta=meta,
               record_prefix='activity_activity_')

   activity_activity_confidence activity_activity_type    timestampMs  latitudeE7  longitudeE7  accuracy  altitude  verticalAccuracy activity.timestampMs
0                           100                  STILL  1549913792265   323518421   -546166813        13         1                 2        1549913286057
1                           100                DRIVING  1549913792265   323518421   -546166813        13         1                 2        1549913730454
2                           100                 ACTIVE  1549912693813   323518421   -546166813        13         1                 2        1549911547308
3                           100                 BIKING  1549912693813   323518421   -546166813        13         1                 2        1549912330473

Edit

If the ['activity', 'activity'] record path is sometimes missing, the above code will throw an error. The following workaround should work for this specific case, but is brittle and might be unacceptably slow depending on the size of your input data:

# Create an example by deleting one of the 'activity' paths 
# from the original dict
del d['locations'][0]['activity']

pd.concat([json_normalize(x, 
                          record_path=['activity', 'activity'] 
                                      if 'activity' in x.keys() else None, 
                          meta=meta, 
                          record_prefix='activity_activity_') 
           for x in d['locations']], 
          axis=0, 
          ignore_index=True,
          sort=False)

   accuracy  altitude  latitudeE7  longitudeE7    timestampMs  verticalAccuracy  activity_activity_confidence activity_activity_type activity.timestampMs
0        13         1   323518421   -546166813  1549913792265                 2                           NaN                    NaN                  NaN
1        13         1   323518421   -546166813  1549912693813                 2                         100.0                 ACTIVE        1549911547308
2        13         1   323518421   -546166813  1549912693813                 2                         100.0                 BIKING        1549912330473

Upvotes: 1

Related Questions