Mia P
Mia P

Reputation: 73

Flattening dictionary with pd.json_normalize

I am currently working on flattening this dictionary file and have reached a number of road blocks. I am trying to use json_normalize to flatten this data. If I test with individual instances it works but if I want to flatten all the data it will return an error stating key error '0' I'm not sure how to fix this.

example of the data-

data = {1:{
      'Name': "Thrilling Tales of Dragon Slayers",
      'IDs':{
            "StoreID": ['123445452543'],
            "BookID": ['543533254353'],
            "SalesID": ['543267765345']},
     2:{
      'Name': "boring Tales of Dragon Slayers",
      'IDs':{
            "StoreID": ['111111', '1121111'],
            "BookID": ['543533254353', '4324232342'],
            "SalesID": ['543267765345', '4353543']}}

my code

d_flat = pd.io.json.json_normalize(data, meta=['Title', 'StoreID', 'BookID', 'SalesID'])

Upvotes: 2

Views: 2368

Answers (2)

Trenton McKinney
Trenton McKinney

Reputation: 62383

  1. Use pandas.DataFrame.from_dict to read data
  2. Convert the values in the 'IDs' column to separate columns
    • .pop removes the old column from df
    • pd.DataFrame(df.pop('IDs').values.tolist()) converts each dict key to a separate column
    • .join the new columns back to df
  3. pd.Series.explode each list in the columns, with .apply.
  • Depending on the data, sometimes the solution is to reshape the data, as shown by piRSquared
import pandas as pd

# test data
data =\
{1: {'IDs': {'BookID': ['543533254353'],
             'SalesID': ['543267765345'],
             'StoreID': ['123445452543']},
     'Name': 'Thrilling Tales of Dragon Slayers'},
 2: {'IDs': {'BookID': ['543533254353', '4324232342'],
             'SalesID': ['543267765345', '4353543'],
             'StoreID': ['111111', '1121111']},
     'Name': 'boring Tales of Dragon Slayers'}}

# load the data using from_dict
df = pd.DataFrame.from_dict(data, orient='index').reset_index(drop=True)

# convert IDs to separate columns
df = df.join(pd.DataFrame(df.pop('IDs').values.tolist()))

# explode the list in each column
df = df.apply(pd.Series.explode).reset_index(drop=True)

# display(df)
                                Name        BookID       SalesID       StoreID
0  Thrilling Tales of Dragon Slayers  543533254353  543267765345  123445452543
1     boring Tales of Dragon Slayers  543533254353  543267765345        111111
2     boring Tales of Dragon Slayers    4324232342       4353543       1121111

Upvotes: 4

piRSquared
piRSquared

Reputation: 294218

Setup

Your data is structured inconveniently. I want to focus on:

  1. Getting the lists in 'IDs' into a list of dictionaries, which would be far more convenient.
  2. Getting rid of the useless keys in the parent dictionary. All we care about are the values.

Your data:

{1: {'Name': 'Thrilling Tales of Dragon Slayers',
     'IDs': {'StoreID': ['123445452543'],
             'BookID': ['543533254353'],
             'SalesID': ['543267765345']}},
 2: {'Name': 'boring Tales of Dragon Slayers',
     'IDs': {'StoreID': ['111111', '1121111'],
             'BookID': ['543533254353', '4324232342'],
             'SalesID': ['543267765345', '4353543']}}}

What I want it to look like:

[{'Name': 'Thrilling Tales of Dragon Slayers',
  'IDs': [{'StoreID': '123445452543',
           'BookID': '543533254353',
           'SalesID': '543267765345'}]},
 {'Name': 'boring Tales of Dragon Slayers',
  'IDs': [{'StoreID': '111111',
           'BookID': '543533254353',
           'SalesID': '543267765345'},
          {'StoreID': '1121111',
           'BookID': '4324232342',
           'SalesID': '4353543'}]}]

Restructure Data

Reasonable Way

Simple loop, don't mess around. This gets us what I showed above

new = []

for v in data.values():
    temp = {**v}           # This is intended to keep all the other data that might be there
    ids = temp.pop('IDs')  # I have to focus on this to create the records
    temp['IDs'] = [dict(zip(ids, x)) for x in zip(*ids.values())]
    new.append(temp)

Cute one-liner

new = [{**v, 'IDs': [dict(zip(v['IDs'], x)) for x in zip(*v['IDs'].values())]} for v in data.values()]

Create DataFrame with pd.json_normalize

In this call to json_normalize we need to specify the path to the records, i.e. the list of id dictionaries found at the 'IDs' key. json_normalize will create one row in the dataframe for every item in that list. This will be done with the the record_path parameter and we pass a tuple that describes the path (if it were in a deeper structure) or a string (if the key is at the top layer, which for us, it is).

record_path = 'IDs'

Then we want to tell json_normalize what keys are metadata for the records. If there are more than one record, as we have, then the metadata will be repeated for each record.

meta = 'Name'

So the final solution looks like this:

pd.json_normalize(new, record_path='IDs', meta='Name')

        StoreID        BookID       SalesID                               Name
0  123445452543  543533254353  543267765345  Thrilling Tales of Dragon Slayers
1        111111  543533254353  543267765345     boring Tales of Dragon Slayers
2       1121111    4324232342       4353543     boring Tales of Dragon Slayers

However

If we are restructuring anyway, might as well make it so we can just pass it to the dataframe constructor.

pd.DataFrame([
    {'Name': r['Name'], **dict(zip(r['IDs'], x))}
    for r in data.values() for x in zip(*r['IDs'].values())
])

                                Name       StoreID        BookID       SalesID
0  Thrilling Tales of Dragon Slayers  123445452543  543533254353  543267765345
1     boring Tales of Dragon Slayers        111111  543533254353  543267765345
2     boring Tales of Dragon Slayers       1121111    4324232342       4353543

Bonus Content

While we are at it. The data is ambiguous in regards to whether or not each id type has the same number of ids. Suppose they did not.

data = {1:{
      'Name': "Thrilling Tales of Dragon Slayers",
      'IDs':{
            "StoreID": ['123445452543'],
            "BookID": ['543533254353'],
            "SalesID": ['543267765345']}},
     2:{
      'Name': "boring Tales of Dragon Slayers",
      'IDs':{
            "StoreID": ['111111', '1121111'],
            "BookID": ['543533254353', '4324232342'],
            "SalesID": ['543267765345', '4353543', 'extra id']}}}

Then we can use zip_longest from itertools

from itertools import zip_longest

pd.DataFrame([
    {'Name': r['Name'], **dict(zip(r['IDs'], x))}
    for r in data.values() for x in zip_longest(*r['IDs'].values())
])

                                Name       StoreID        BookID       SalesID
0  Thrilling Tales of Dragon Slayers  123445452543  543533254353  543267765345
1     boring Tales of Dragon Slayers        111111  543533254353  543267765345
2     boring Tales of Dragon Slayers       1121111    4324232342       4353543
3     boring Tales of Dragon Slayers          None          None      extra id

Upvotes: 3

Related Questions