Reputation: 467
So the data has the column [outbreakMap] as a pandas series. I recreated a snippet of the data as a df below:
data = {0: {'id': '53545',
'latitude': -1.5168500000000003,
'longitude': 105.2985,
'outbreakMap': {'ob_102795': {'nationalObReference': '403/KPTS/05/22',
'oieReference': 'ob_102795',
'outbreakStartDate': '2022-04-12'},
'ob_102796': {'nationalObReference': '404/KPTS/05/22',
'oieReference': 'ob_102796',
'outbreakStartDate': '2022-04-22'}}},
1: {'id': '53709',
'latitude': 36.42056,
'longitude': 139.9085,
'outbreakMap': {'ob_101531': {'oieReference': 'ob_101531',
'outbreakStartDate': '2022-04-12'},
'ob_101644': {'oieReference': 'ob_101644',
'outbreakStartDate': '2022-04-14'},
'ob_102290': {'oieReference': 'ob_102290',
'outbreakStartDate': '2022-04-21'},
'ob_100121': {'oieReference': 'ob_100121',
'outbreakStartDate': '2022-03-24'},
'ob_102949': {'oieReference': 'ob_102949',
'outbreakStartDate': '2022-05-09'}}}}
import pandas as pd
df = pd.DataFrame.from_dict(data, orient='index')
Goal: Unpack the dictionary into a long format like-so:
Thank you.
Upvotes: 1
Views: 320
Reputation: 16683
You can use list comprehension:
lst_3d = [
[
[data[n]['id'],
data[n]['latitude'],
data[n]['longitude'],
data[n]['outbreakMap'][i]['oieReference'],
data[n]['outbreakMap'][i]['outbreakStartDate']
]
for i in data[n]['outbreakMap']]
for n in data]
#convert to 2d list
lst = [elem for subl in lst_3d for elem in subl]
df = pd.DataFrame(lst, columns=['id', 'latitude', 'longitude', 'oieReference', 'outbreakStartDate'])
df
Out[1]:
id latitude longitude oieReference outbreakStartDate
0 53545 -1.51685 105.2985 ob_102795 2022-04-12
1 53545 -1.51685 105.2985 ob_102796 2022-04-22
2 53709 36.42056 139.9085 ob_101531 2022-04-12
3 53709 36.42056 139.9085 ob_101644 2022-04-14
4 53709 36.42056 139.9085 ob_102290 2022-04-21
5 53709 36.42056 139.9085 ob_100121 2022-03-24
6 53709 36.42056 139.9085 ob_102949 2022-05-09
Upvotes: 1
Reputation: 77367
Since some of the data is duplicated across multiple rows, I think its easier to write a python table and then convert that to a dataframe. Dataframes need to know the full number of rows when they are first created - so pandas would need to make an intermediate list anyway.
Its just a nested for loop that puts rows into a master list, duplicating the data from the outer loop in the inner loop. The data is pulled from the dictionaries via column name lists so we get the right order for each row.
I noticed that outbreakMap appears to be duplicated so skipped it and nationalObReference isn't in all of the data. I also put the code into a function which is an easy way to discard intermediate data that isn't needed after the calculation is done.
data = {0: {'id': '53545',
'latitude': -1.5168500000000003,
'longitude': 105.2985,
'outbreakMap': {'ob_102795': {'nationalObReference': '403/KPTS/05/22',
'oieReference': 'ob_102795',
'outbreakStartDate': '2022-04-12'},
'ob_102796': {'nationalObReference': '404/KPTS/05/22',
'oieReference': 'ob_102796',
'outbreakStartDate': '2022-04-22'}}},
1: {'id': '53709',
'latitude': 36.42056,
'longitude': 139.9085,
'outbreakMap': {'ob_101531': {'oieReference': 'ob_101531',
'outbreakStartDate': '2022-04-12'},
'ob_101644': {'oieReference': 'ob_101644',
'outbreakStartDate': '2022-04-14'},
'ob_102290': {'oieReference': 'ob_102290',
'outbreakStartDate': '2022-04-21'},
'ob_100121': {'oieReference': 'ob_100121',
'outbreakStartDate': '2022-03-24'},
'ob_102949': {'oieReference': 'ob_102949',
'outbreakStartDate': '2022-05-09'}}}}
import pandas as pd
def data_to_df(data):
common_cols = ["id", "latitude", "longitude"]
outbreak_cols = ["nationalObReference", "oieReference", "outbreakStartDate"]
table = []
for index, common in data.items():
common_row = [common[name] for name in common_cols]
for ob, ref in common["outbreakMap"].items():
row = common_row + [ref.get(name, None) for name in outbreak_cols]
table.append(row)
df = pd.DataFrame(table, columns=common_cols+outbreak_cols)
return df
df = data_to_df(data)
Upvotes: 1
Reputation:
While you can build the DataFrame and work on outbreakMap
column afterwards, it might be easier and more efficient if you modify data
to include the desired keys to the inner dicts, then construct the DataFrame and explode
the added columns:
for v in data.values():
outbreakMap = v.pop('outbreakMap')
v['outbreak'] = list(outbreakMap)
v['outbreak_information'] = list(outbreakMap.values())
out = pd.DataFrame.from_dict(data, orient='index').explode(['outbreak','outbreak_information'])
Output:
id latitude longitude outbreak outbreak_information
0 53545 -1.51685 105.2985 ob_102795 {'nationalObReference': '403/KPTS/05/22', 'oie...
0 53545 -1.51685 105.2985 ob_102796 {'nationalObReference': '404/KPTS/05/22', 'oie...
1 53709 36.42056 139.9085 ob_101531 {'oieReference': 'ob_101531', 'outbreakStartDa...
1 53709 36.42056 139.9085 ob_101644 {'oieReference': 'ob_101644', 'outbreakStartDa...
1 53709 36.42056 139.9085 ob_102290 {'oieReference': 'ob_102290', 'outbreakStartDa...
1 53709 36.42056 139.9085 ob_100121 {'oieReference': 'ob_100121', 'outbreakStartDa...
1 53709 36.42056 139.9085 ob_102949 {'oieReference': 'ob_102949', 'outbreakStartDa...
Upvotes: 1
Reputation: 526
[EDIT]
I think this is what you want.
def into_long(outbreakMap):
'''
Extracting key, oieReference, and outbreakStartDate
from outbreakMap
'''
keys = outbreakMap.keys()
# return list of tuples
return [(
key,
outbreakMap[key]['oieReference'],
outbreakMap[key]['outbreakStartDate']
) for key in keys]
df['outbreakMap_tuple'] = df['outbreakMap'].apply(lambda x: into_long(x))
# list to multiple rows
# further reading: https://stackoverflow.com/questions/39954668/how-to-convert-column-with-list-of-values-into-rows-in-pandas-dataframe
df = df.explode('outbreakMap_tuple')
# tuples to multiple columns
# further reading: https://stackoverflow.com/questions/29550414/how-can-i-split-a-column-of-tuples-in-a-pandas-dataframe
df[['outbreakMap', 'oieReference', 'outbreakStartDate']] = pd.DataFrame(df['outbreakMap_tuple'].tolist(), index=df.index)
df.drop('outbreakMap_tuple', axis=1, inplace=True)
# this is optional
df.reset_index(drop=True, inplace=True)
The outcome looks like
I am sure that there are better ways. This is just one of them.
Upvotes: 2