Reputation: 534
Here is the example JSON:
{
"ApartmentBuilding":{
"Address":{
"HouseNumber": 5,
"Street": "DataStreet",
"ZipCode": 5100
},
"Apartments":[
{
"Number": 1,
"Price": 500,
"Residents": [
{
"Name": "Bob",
"Age": 43
},
{
"Name": "Alice",
"Age": 42
}
]
},
{
"Number": 2,
"Price": 750,
"Residents": [
{
"Name": "Jane",
"Age": 43
},
{
"Name": "William",
"Age": 42
}
]
},
{
"Number": 3,
"Price": 1000,
"Residents": []
}
]
}
}
I used below function from : Python Pandas - Flatten Nested JSON
import json
import pandas as pd
def flatten_json(nested_json: dict, exclude: list=['']) -> dict:
"""
Flatten a list of nested dicts.
"""
out = dict()
def flatten(x: (list, dict, str), name: str='', exclude=exclude):
if type(x) is dict:
for a in x:
if a not in exclude:
flatten(x[a], f'{name}{a}_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, f'{name}{i}_')
i += 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
with open("apartments.json") as f:
data = json.load(f)
print(data)
df = pd.DataFrame([flatten_json(x) for x in data['ApartmentBuilding']])
print(df)
I am aiming to completely flatten the JSON to be transformed into a Panda Dataframe however, get a strange output as belows:
0 Address
1 Apartments
I am basically after something flatten like this:
Upvotes: 1
Views: 229
Reputation: 13478
Here is another way to do it using Pandas json_normalize and explode:
import json
import pandas as pd
with open("file.json") as f:
data = json.load(f)
df = pd.json_normalize(data["ApartmentBuilding"])
First, get and flatten the different parts of the dataframe:
building = df.explode("Apartments").reset_index(drop=True)
apartments = pd.DataFrame(building["Apartments"].to_dict()).T.explode("Residents")
residents = pd.DataFrame(
apartments["Residents"].dropna().reset_index(drop=True).to_list()
)
Then, build the dataframe back by merging and concatenating the flattened parts:
new_df = pd.merge(
left=building.loc[:, building.columns != "Apartments"],
right=apartments.loc[:, apartments.columns != "Residents"],
right_index=True,
left_index=True,
).reset_index(drop=True)
new_df = pd.concat([new_df, residents], axis=1)
Do a little cleanup:
new_df.columns = [col.replace("Address.", "") for col in new_df.columns]
And finally:
print(new_df)
# Output
HouseNumber Street ZipCode Number Price Name Age
0 5 DataStreet 5100 1 500 Bob 43.0
1 5 DataStreet 5100 1 500 Alice 42.0
2 5 DataStreet 5100 2 750 Jane 43.0
3 5 DataStreet 5100 2 750 William 42.0
4 5 DataStreet 5100 3 1000 NaN NaN
Upvotes: 2