Reputation: 1
I have some json with multiple blocks with this format (using only one here to make it simple, so in this example the dataframe would have only one line):
{
"A": 1,
"B": {
"C": [
{
"D": 2,
"E": 3
}
],
"F": {
"G": 4,
"H": 5
}
}
}
And I want to create a DataFrame like this:
A B.C.D B.C.E B.F.G B.F.H
1 1 2 3 4 5
When I try to do
with open('some.json') as file:
data = json.load(file)
df = pd.json_normalize(data)
I get something like this:
A B.C B.F.G B.F.H
1 1 [{"D":2,"E":3}] 4 5
So... I can get the column B.C, break it into the B.C.D and B.C.E
df2 = pd.DataFrame(df['B.C'].tolist())
df3 = df2[0].apply(pd.Series) #The [0] here is the only way to work when I have more than one block in the json
Them later concatenate with the previous dataframe (and removing the B.C column) but it feels ugly and since I'm doing this a LOT I was thinking if there's a cleaner/faster way.
Well, thanks in advance!
Upvotes: 0
Views: 411
Reputation: 7594
You should checkout flatten JSON. It's the best way to flatten JSON's with multiple record paths
import flatten_json
with open('1.json', 'r+') as f:
data = json.load(f)
dic_flattened = flatten_json.flatten(data)
df = pd.json_normalize(dic_flattened)
print(df)
A B_C_0_D B_C_0_E B_F_G B_F_H
0 1 2 3 4 5
Upvotes: 1
Reputation: 15872
I guess you could write a recursive solution to preprocess the data. There might be some existing inbuilt solution but I am not aware of it. You can test the performance of the following:
def clean(data):
clean_data = {}
def parse(data, key=''):
if isinstance(data, list):
for elem in data:
parse(elem,key=key)
else:
for k, v in data.items():
if isinstance(v, (dict, list)):
parse(v, key=key+'.'+k)
else:
clean_data[(key+'.'+k).lstrip('.')] = v
parse(data)
return [clean_data]
data = {'A': 1, 'B': {'C': [{'D': 2, 'E': 3}], 'F': {'G': 4, 'H': 5}}}
print(pd.DataFrame(clean(data)))
Output:
A B.C.D B.C.E B.F.G B.F.H
0 1 2 3 4 5
Upvotes: 2
Reputation: 43169
Write yourself a recursive function:
def get_values(iterable, root=""):
if isinstance(iterable, dict):
for key, values in iterable.items():
if isinstance(values, (dict, list)):
new_root = "{}.{}".format(root, key) if root else key
yield from get_values(values, new_root)
else:
absolute_key = "{}.{}".format(root, key) if root else key
yield absolute_key, values
elif isinstance(iterable, list):
for dct in iterable:
yield from get_values(dct, root)
result = [item for item in get_values(data)]
print(result)
Which yields
[('A', 1), ('B.C.D', 2), ('B.C.E', 3), ('B.F.G', 4), ('B.F.H', 5)]
To transform it into a DataFrame
, use:
result = dict([item for item in get_values(data)])
import pandas as pd
df = pd.DataFrame(result, index=[0])
print(df)
Which then yields
A B.C.D B.C.E B.F.G B.F.H
0 1 2 3 4 5
Upvotes: 1