Flattening dictionary with pd.json_normalize

Question

I am currently working on flattening this dictionary file and have reached a number of road blocks. I am trying to use json_normalize to flatten this data. If I test with individual instances it works but if I want to flatten all the data it will return an error stating key error '0' I'm not sure how to fix this.

example of the data-

data = {1:{
      'Name': "Thrilling Tales of Dragon Slayers",
      'IDs':{
            "StoreID": ['123445452543'],
            "BookID": ['543533254353'],
            "SalesID": ['543267765345']},
     2:{
      'Name': "boring Tales of Dragon Slayers",
      'IDs':{
            "StoreID": ['111111', '1121111'],
            "BookID": ['543533254353', '4324232342'],
            "SalesID": ['543267765345', '4353543']}}

my code

d_flat = pd.io.json.json_normalize(data, meta=['Title', 'StoreID', 'BookID', 'SalesID'])

piRSquared · Accepted Answer

Setup

Your data is structured inconveniently. I want to focus on:

Getting the lists in 'IDs' into a list of dictionaries, which would be far more convenient.
Getting rid of the useless keys in the parent dictionary. All we care about are the values.

Your data:

{1: {'Name': 'Thrilling Tales of Dragon Slayers',
     'IDs': {'StoreID': ['123445452543'],
             'BookID': ['543533254353'],
             'SalesID': ['543267765345']}},
 2: {'Name': 'boring Tales of Dragon Slayers',
     'IDs': {'StoreID': ['111111', '1121111'],
             'BookID': ['543533254353', '4324232342'],
             'SalesID': ['543267765345', '4353543']}}}

What I want it to look like:

[{'Name': 'Thrilling Tales of Dragon Slayers',
  'IDs': [{'StoreID': '123445452543',
           'BookID': '543533254353',
           'SalesID': '543267765345'}]},
 {'Name': 'boring Tales of Dragon Slayers',
  'IDs': [{'StoreID': '111111',
           'BookID': '543533254353',
           'SalesID': '543267765345'},
          {'StoreID': '1121111',
           'BookID': '4324232342',
           'SalesID': '4353543'}]}]

Restructure Data

Reasonable Way

Simple loop, don't mess around. This gets us what I showed above

new = []

for v in data.values():
    temp = {**v}           # This is intended to keep all the other data that might be there
    ids = temp.pop('IDs')  # I have to focus on this to create the records
    temp['IDs'] = [dict(zip(ids, x)) for x in zip(*ids.values())]
    new.append(temp)

Cute one-liner

new = [{**v, 'IDs': [dict(zip(v['IDs'], x)) for x in zip(*v['IDs'].values())]} for v in data.values()]

Create `DataFrame` with `pd.json_normalize`

In this call to json_normalize we need to specify the path to the records, i.e. the list of id dictionaries found at the 'IDs' key. json_normalize will create one row in the dataframe for every item in that list. This will be done with the the record_path parameter and we pass a tuple that describes the path (if it were in a deeper structure) or a string (if the key is at the top layer, which for us, it is).

record_path = 'IDs'

Then we want to tell json_normalize what keys are metadata for the records. If there are more than one record, as we have, then the metadata will be repeated for each record.

meta = 'Name'

So the final solution looks like this:

pd.json_normalize(new, record_path='IDs', meta='Name')

        StoreID        BookID       SalesID                               Name
0  123445452543  543533254353  543267765345  Thrilling Tales of Dragon Slayers
1        111111  543533254353  543267765345     boring Tales of Dragon Slayers
2       1121111    4324232342       4353543     boring Tales of Dragon Slayers

However

If we are restructuring anyway, might as well make it so we can just pass it to the dataframe constructor.

pd.DataFrame([
    {'Name': r['Name'], **dict(zip(r['IDs'], x))}
    for r in data.values() for x in zip(*r['IDs'].values())
])

                                Name       StoreID        BookID       SalesID
0  Thrilling Tales of Dragon Slayers  123445452543  543533254353  543267765345
1     boring Tales of Dragon Slayers        111111  543533254353  543267765345
2     boring Tales of Dragon Slayers       1121111    4324232342       4353543

Bonus Content

While we are at it. The data is ambiguous in regards to whether or not each id type has the same number of ids. Suppose they did not.

data = {1:{
      'Name': "Thrilling Tales of Dragon Slayers",
      'IDs':{
            "StoreID": ['123445452543'],
            "BookID": ['543533254353'],
            "SalesID": ['543267765345']}},
     2:{
      'Name': "boring Tales of Dragon Slayers",
      'IDs':{
            "StoreID": ['111111', '1121111'],
            "BookID": ['543533254353', '4324232342'],
            "SalesID": ['543267765345', '4353543', 'extra id']}}}

Then we can use zip_longest from itertools

from itertools import zip_longest

pd.DataFrame([
    {'Name': r['Name'], **dict(zip(r['IDs'], x))}
    for r in data.values() for x in zip_longest(*r['IDs'].values())
])

                                Name       StoreID        BookID       SalesID
0  Thrilling Tales of Dragon Slayers  123445452543  543533254353  543267765345
1     boring Tales of Dragon Slayers        111111  543533254353  543267765345
2     boring Tales of Dragon Slayers       1121111    4324232342       4353543
3     boring Tales of Dragon Slayers          None          None      extra id

Flattening dictionary with pd.json_normalize

Answers (2)

Setup

Restructure Data

Reasonable Way

Cute one-liner

Create `DataFrame` with `pd.json_normalize`

However

Bonus Content

Related Questions

Flattening dictionary with pd.json_normalize

Answers (2)

Setup

Restructure Data

Reasonable Way

Cute one-liner

Create DataFrame with pd.json_normalize

However

Bonus Content

Related Questions

Create `DataFrame` with `pd.json_normalize`