sharp
sharp

Reputation: 2158

Pandas - create data frame from nested key values and nested list in the dictionary

How to do untangle nested dictionary with list in key/value into columns? I tried different combination to solve converting nested dictionary into pandas data frame. Looked through the stack I am getting close to fix the problem just not quite.

Sample Data:

test = {
    'abc': {
        'company_id': '123c',
        'names': ['Oscar', 'John Smith', 'Smith, John'],
        'education': ['MS', 'BS']
    },
    'DEF': {
        'company_id': '124b',
        'names': ['Matt B.'],
        'education': ['']
    }
}

Tried:

1)

pd.DataFrame(list(test.items())) # not working entirely - creates {dictionary in col '1'}

2)

df = pd.concat({
        k: pd.DataFrame.from_dict(v, 'index') for k, v in test.items()
    }, 
    axis=0)

df2 = df.T
df2.reset_index() # creates multiple columns

Output Needed:

enter image description here

Upvotes: 2

Views: 755

Answers (2)

G.G
G.G

Reputation: 765

pd.DataFrame.from_dict(test, orient='index').explode(['names']).explode(['education']).reset_index().rename(columns={'index':'set_name'})


 set_name company_id        names education
0      abc       123c        Oscar        MS
1      abc       123c        Oscar        BS
2      abc       123c   John Smith        MS
3      abc       123c   John Smith        BS
4      abc       123c  Smith, John        MS
5      abc       123c  Smith, John        BS
6      DEF       124b      Matt B.

Upvotes: 0

sobek
sobek

Reputation: 1426

Update:

With the release of pandas 0.25 and the addition of explode this is now a lot easier:

frame = pd.DataFrame(test).T
frame = frame.explode('names').set_index(
    ['company_id', 'names'],
    append=True).explode(
    'education').reset_index(
    ['company_id', 'names']
)

Pre pandas 0.25:

This is not really lean but then this is a rather complicated transformation. Inspired by this blog post, I solved it using two separate iterations of turning the list column into a series and then transforming the DataFrame using melt.

import pandas as pd

test = {
    'abc': {
        'company_id': '123c',
        'names': ['Oscar', 'John Smith', 'Smith, John'],
        'education': ['MS', 'BS']
    },
    'DEF': {
        'company_id': '124b',
        'names': ['Matt B.'],
        'education': ['']
    }
}

frame = pd.DataFrame(test).T

names = frame.names.apply(pd.Series)
frame = frame.merge(
    names, left_index=True, right_index=True).drop('names', axis=1)
frame = frame.reset_index().melt(
    id_vars=['index', 'company_id', 'education'],
    value_name='names').drop('variable', axis=1).dropna()

education = frame.education.apply(pd.Series)
frame = frame.merge(
    education, left_index=True, right_index=True).drop('education', axis=1)
frame = frame.melt(
    id_vars=['index', 'company_id', 'names'],
    value_name='education').drop(
    'variable', axis=1).dropna().sort_values(by=['company_id', 'names'])

frame.columns = ['set_name', 'company_id', 'names', 'education']

print(frame)

Result:

  set_name company_id        names education
2      abc       123c   John Smith        MS
6      abc       123c   John Smith        BS
0      abc       123c        Oscar        MS
4      abc       123c        Oscar        BS
3      abc       123c  Smith, John        MS
7      abc       123c  Smith, John        BS
1      DEF       124b      Matt B.          

Upvotes: 2

Related Questions