Anonymous
Anonymous

Reputation: 467

Is there a pandas function to unpack a column that's a dict into a long format?

So the data has the column [outbreakMap] as a pandas series. I recreated a snippet of the data as a df below:

Initial Data: enter image description here

data = {0: {'id': '53545',
  'latitude': -1.5168500000000003,
  'longitude': 105.2985,
  'outbreakMap': {'ob_102795': {'nationalObReference': '403/KPTS/05/22',
    'oieReference': 'ob_102795',
    'outbreakStartDate': '2022-04-12'},
   'ob_102796': {'nationalObReference': '404/KPTS/05/22',
    'oieReference': 'ob_102796',
    'outbreakStartDate': '2022-04-22'}}},
 1: {'id': '53709',
  'latitude': 36.42056,
  'longitude': 139.9085,
  'outbreakMap': {'ob_101531': {'oieReference': 'ob_101531',
    'outbreakStartDate': '2022-04-12'},
   'ob_101644': {'oieReference': 'ob_101644',
    'outbreakStartDate': '2022-04-14'},
   'ob_102290': {'oieReference': 'ob_102290',
    'outbreakStartDate': '2022-04-21'},
   'ob_100121': {'oieReference': 'ob_100121',
    'outbreakStartDate': '2022-03-24'},
   'ob_102949': {'oieReference': 'ob_102949',
    'outbreakStartDate': '2022-05-09'}}}}

import pandas as pd
df = pd.DataFrame.from_dict(data, orient='index')

Goal: Unpack the dictionary into a long format like-so:

enter image description here

Thank you.

Upvotes: 1

Views: 320

Answers (4)

David Erickson
David Erickson

Reputation: 16683

You can use list comprehension:

lst_3d = [
[
    [data[n]['id'], 
     data[n]['latitude'], 
     data[n]['longitude'], 
     data[n]['outbreakMap'][i]['oieReference'],
     data[n]['outbreakMap'][i]['outbreakStartDate']
] 
    for i in data[n]['outbreakMap']]
for n in data]

#convert to 2d list
lst = [elem for subl in lst_3d for elem in subl]
df = pd.DataFrame(lst, columns=['id', 'latitude', 'longitude', 'oieReference', 'outbreakStartDate'])
df
Out[1]: 
      id  latitude  longitude oieReference outbreakStartDate
0  53545  -1.51685   105.2985    ob_102795        2022-04-12
1  53545  -1.51685   105.2985    ob_102796        2022-04-22
2  53709  36.42056   139.9085    ob_101531        2022-04-12
3  53709  36.42056   139.9085    ob_101644        2022-04-14
4  53709  36.42056   139.9085    ob_102290        2022-04-21
5  53709  36.42056   139.9085    ob_100121        2022-03-24
6  53709  36.42056   139.9085    ob_102949        2022-05-09

Upvotes: 1

tdelaney
tdelaney

Reputation: 77367

Since some of the data is duplicated across multiple rows, I think its easier to write a python table and then convert that to a dataframe. Dataframes need to know the full number of rows when they are first created - so pandas would need to make an intermediate list anyway.

Its just a nested for loop that puts rows into a master list, duplicating the data from the outer loop in the inner loop. The data is pulled from the dictionaries via column name lists so we get the right order for each row.

I noticed that outbreakMap appears to be duplicated so skipped it and nationalObReference isn't in all of the data. I also put the code into a function which is an easy way to discard intermediate data that isn't needed after the calculation is done.

data = {0: {'id': '53545',
  'latitude': -1.5168500000000003,
  'longitude': 105.2985,
  'outbreakMap': {'ob_102795': {'nationalObReference': '403/KPTS/05/22',
    'oieReference': 'ob_102795',
    'outbreakStartDate': '2022-04-12'},
   'ob_102796': {'nationalObReference': '404/KPTS/05/22',
    'oieReference': 'ob_102796',
    'outbreakStartDate': '2022-04-22'}}},
 1: {'id': '53709',
  'latitude': 36.42056,
  'longitude': 139.9085,
  'outbreakMap': {'ob_101531': {'oieReference': 'ob_101531',
    'outbreakStartDate': '2022-04-12'},
   'ob_101644': {'oieReference': 'ob_101644',
    'outbreakStartDate': '2022-04-14'},
   'ob_102290': {'oieReference': 'ob_102290',
    'outbreakStartDate': '2022-04-21'},
   'ob_100121': {'oieReference': 'ob_100121',
    'outbreakStartDate': '2022-03-24'},
   'ob_102949': {'oieReference': 'ob_102949',
    'outbreakStartDate': '2022-05-09'}}}}

import pandas as pd

def data_to_df(data):
    common_cols = ["id", "latitude", "longitude"]
    outbreak_cols = ["nationalObReference", "oieReference", "outbreakStartDate"]
    table = []
    for index, common in data.items():
        common_row = [common[name] for name in common_cols]
        for ob, ref in common["outbreakMap"].items():
            row = common_row + [ref.get(name, None) for name in outbreak_cols]
            table.append(row)
    df = pd.DataFrame(table, columns=common_cols+outbreak_cols)
    return df

df = data_to_df(data)

Upvotes: 1

user7864386
user7864386

Reputation:

While you can build the DataFrame and work on outbreakMap column afterwards, it might be easier and more efficient if you modify data to include the desired keys to the inner dicts, then construct the DataFrame and explode the added columns:

for v in data.values():
    outbreakMap = v.pop('outbreakMap')
    v['outbreak'] = list(outbreakMap)
    v['outbreak_information'] = list(outbreakMap.values())
    
out = pd.DataFrame.from_dict(data, orient='index').explode(['outbreak','outbreak_information'])

Output:

      id  latitude  longitude   outbreak                               outbreak_information
0  53545  -1.51685   105.2985  ob_102795  {'nationalObReference': '403/KPTS/05/22', 'oie...
0  53545  -1.51685   105.2985  ob_102796  {'nationalObReference': '404/KPTS/05/22', 'oie...
1  53709  36.42056   139.9085  ob_101531  {'oieReference': 'ob_101531', 'outbreakStartDa...
1  53709  36.42056   139.9085  ob_101644  {'oieReference': 'ob_101644', 'outbreakStartDa...
1  53709  36.42056   139.9085  ob_102290  {'oieReference': 'ob_102290', 'outbreakStartDa...
1  53709  36.42056   139.9085  ob_100121  {'oieReference': 'ob_100121', 'outbreakStartDa...
1  53709  36.42056   139.9085  ob_102949  {'oieReference': 'ob_102949', 'outbreakStartDa...

Upvotes: 1

Jeong Kim
Jeong Kim

Reputation: 526

[EDIT]

I think this is what you want.

def into_long(outbreakMap):
    '''
    Extracting key, oieReference, and outbreakStartDate
    from outbreakMap
    '''
    keys = outbreakMap.keys()
    # return list of tuples
    return [(
        key,
        outbreakMap[key]['oieReference'],
        outbreakMap[key]['outbreakStartDate']
    ) for key in keys]

df['outbreakMap_tuple'] = df['outbreakMap'].apply(lambda x: into_long(x))

# list to multiple rows
# further reading: https://stackoverflow.com/questions/39954668/how-to-convert-column-with-list-of-values-into-rows-in-pandas-dataframe
df = df.explode('outbreakMap_tuple')

# tuples to multiple columns
# further reading: https://stackoverflow.com/questions/29550414/how-can-i-split-a-column-of-tuples-in-a-pandas-dataframe
df[['outbreakMap', 'oieReference', 'outbreakStartDate']] = pd.DataFrame(df['outbreakMap_tuple'].tolist(), index=df.index)

df.drop('outbreakMap_tuple', axis=1, inplace=True)

# this is optional
df.reset_index(drop=True, inplace=True)

The outcome looks like

enter image description here

I am sure that there are better ways. This is just one of them.

Upvotes: 2

Related Questions