Reputation: 73
I am currently working on flattening this dictionary file and have reached a number of road blocks. I am trying to use json_normalize
to flatten this data. If I test with individual instances it works but if I want to flatten all the data it will return an error stating key error '0'
I'm not sure how to fix this.
example of the data-
data = {1:{
'Name': "Thrilling Tales of Dragon Slayers",
'IDs':{
"StoreID": ['123445452543'],
"BookID": ['543533254353'],
"SalesID": ['543267765345']},
2:{
'Name': "boring Tales of Dragon Slayers",
'IDs':{
"StoreID": ['111111', '1121111'],
"BookID": ['543533254353', '4324232342'],
"SalesID": ['543267765345', '4353543']}}
my code
d_flat = pd.io.json.json_normalize(data, meta=['Title', 'StoreID', 'BookID', 'SalesID'])
Upvotes: 2
Views: 2368
Reputation: 62383
pandas.DataFrame.from_dict
to read data
'IDs'
column to separate columns
.pop
removes the old column from df
pd.DataFrame(df.pop('IDs').values.tolist())
converts each dict key
to a separate column.join
the new columns back to df
pd.Series.explode
each list
in the columns, with .apply
.import pandas as pd
# test data
data =\
{1: {'IDs': {'BookID': ['543533254353'],
'SalesID': ['543267765345'],
'StoreID': ['123445452543']},
'Name': 'Thrilling Tales of Dragon Slayers'},
2: {'IDs': {'BookID': ['543533254353', '4324232342'],
'SalesID': ['543267765345', '4353543'],
'StoreID': ['111111', '1121111']},
'Name': 'boring Tales of Dragon Slayers'}}
# load the data using from_dict
df = pd.DataFrame.from_dict(data, orient='index').reset_index(drop=True)
# convert IDs to separate columns
df = df.join(pd.DataFrame(df.pop('IDs').values.tolist()))
# explode the list in each column
df = df.apply(pd.Series.explode).reset_index(drop=True)
# display(df)
Name BookID SalesID StoreID
0 Thrilling Tales of Dragon Slayers 543533254353 543267765345 123445452543
1 boring Tales of Dragon Slayers 543533254353 543267765345 111111
2 boring Tales of Dragon Slayers 4324232342 4353543 1121111
Upvotes: 4
Reputation: 294218
Your data is structured inconveniently. I want to focus on:
'IDs'
into a list of dictionaries, which would be far more convenient.Your data
:
{1: {'Name': 'Thrilling Tales of Dragon Slayers',
'IDs': {'StoreID': ['123445452543'],
'BookID': ['543533254353'],
'SalesID': ['543267765345']}},
2: {'Name': 'boring Tales of Dragon Slayers',
'IDs': {'StoreID': ['111111', '1121111'],
'BookID': ['543533254353', '4324232342'],
'SalesID': ['543267765345', '4353543']}}}
What I want it to look like:
[{'Name': 'Thrilling Tales of Dragon Slayers',
'IDs': [{'StoreID': '123445452543',
'BookID': '543533254353',
'SalesID': '543267765345'}]},
{'Name': 'boring Tales of Dragon Slayers',
'IDs': [{'StoreID': '111111',
'BookID': '543533254353',
'SalesID': '543267765345'},
{'StoreID': '1121111',
'BookID': '4324232342',
'SalesID': '4353543'}]}]
Simple loop, don't mess around. This gets us what I showed above
new = []
for v in data.values():
temp = {**v} # This is intended to keep all the other data that might be there
ids = temp.pop('IDs') # I have to focus on this to create the records
temp['IDs'] = [dict(zip(ids, x)) for x in zip(*ids.values())]
new.append(temp)
new = [{**v, 'IDs': [dict(zip(v['IDs'], x)) for x in zip(*v['IDs'].values())]} for v in data.values()]
DataFrame
with pd.json_normalize
In this call to json_normalize
we need to specify the path to the records, i.e. the list of id dictionaries found at the 'IDs'
key. json_normalize
will create one row in the dataframe for every item in that list. This will be done with the the record_path
parameter and we pass a tuple
that describes the path (if it were in a deeper structure) or a string (if the key is at the top layer, which for us, it is).
record_path = 'IDs'
Then we want to tell json_normalize
what keys are metadata for the records. If there are more than one record, as we have, then the metadata will be repeated for each record.
meta = 'Name'
So the final solution looks like this:
pd.json_normalize(new, record_path='IDs', meta='Name')
StoreID BookID SalesID Name
0 123445452543 543533254353 543267765345 Thrilling Tales of Dragon Slayers
1 111111 543533254353 543267765345 boring Tales of Dragon Slayers
2 1121111 4324232342 4353543 boring Tales of Dragon Slayers
If we are restructuring anyway, might as well make it so we can just pass it to the dataframe constructor.
pd.DataFrame([
{'Name': r['Name'], **dict(zip(r['IDs'], x))}
for r in data.values() for x in zip(*r['IDs'].values())
])
Name StoreID BookID SalesID
0 Thrilling Tales of Dragon Slayers 123445452543 543533254353 543267765345
1 boring Tales of Dragon Slayers 111111 543533254353 543267765345
2 boring Tales of Dragon Slayers 1121111 4324232342 4353543
While we are at it. The data is ambiguous in regards to whether or not each id type has the same number of ids. Suppose they did not.
data = {1:{
'Name': "Thrilling Tales of Dragon Slayers",
'IDs':{
"StoreID": ['123445452543'],
"BookID": ['543533254353'],
"SalesID": ['543267765345']}},
2:{
'Name': "boring Tales of Dragon Slayers",
'IDs':{
"StoreID": ['111111', '1121111'],
"BookID": ['543533254353', '4324232342'],
"SalesID": ['543267765345', '4353543', 'extra id']}}}
Then we can use zip_longest
from itertools
from itertools import zip_longest
pd.DataFrame([
{'Name': r['Name'], **dict(zip(r['IDs'], x))}
for r in data.values() for x in zip_longest(*r['IDs'].values())
])
Name StoreID BookID SalesID
0 Thrilling Tales of Dragon Slayers 123445452543 543533254353 543267765345
1 boring Tales of Dragon Slayers 111111 543533254353 543267765345
2 boring Tales of Dragon Slayers 1121111 4324232342 4353543
3 boring Tales of Dragon Slayers None None extra id
Upvotes: 3