Reputation: 7818
I was given a pretty big json file that looks like this minimal example:
json_file = """
{
"products":
[
{
"id":"0",
"name": "First",
"emptylist":[],
"properties" :
{
"id" : "",
"name" : ""
}
},
{
"id":"1",
"name": "Second",
"emptylist":[],
"properties":
{
"id" : "23",
"name" : "a useful product",
"features" :
[
{
"name":"Features",
"id":"18",
"features":
[
{
"id":"1001",
"name":"Colour",
"value":"Black"
},
{
"id":"2093",
"name":"Material",
"value":"Plastic"
}
]
},
{
"name":"Sizes",
"id":"34",
"features":
[
{
"id":"4736",
"name":"Length",
"value":"56"
},
{
"id":"8745",
"name":"Width",
"value":"76"
}
]
}
]
}
},
{
"id":"2",
"name": "Third",
"properties" :
{
"id" : "876",
"name" : "another one",
"features" :
[
{
"name":"Box",
"id":"937",
"features":
[
{
"id":"3758",
"name":"Amount",
"value":"1"
},
{
"id":"2222",
"name":"Packaging",
"value":"Blister"
}
]
},
{
"name":"Features",
"id":"8473",
"features":
[
{
"id":"9372",
"name":"Colour",
"value":"White"
},
{
"id":"9375",
"name":"Position",
"value":"A"
},
{
"id":"2654",
"name":"Amount",
"value":"6"
}
]
}
]
}
}
]
}
"""
And I want to make a flat table out of it. It should look like this one:
id name emptylist properties.id properties.name properties.features.name properties.features.id properties.features.features.id properties.features.features.name properties.features.features.value
0 First [] "" "" NaN NaN NaN NaN NaN
1 Second [] "23" "a useful product" Features 18 1001 Colour Black
1 Second [] "23" "a useful product" Features 18 2093 Material Plastic
1 Second [] "23" "a useful product" Sizes 34 4736 Length 56
1 Second [] "23" "a useful product" Sizes 34 8745 Width 76
2 Third "876" "another one" Box 937 3758 Amount 1
2 Third "876" "another one" Box 937 2222 Packaging Blister
2 Third "876" "another one" Features 8473 9372 Colour White
2 Third "876" "another one" Features 8473 9375 Position A
2 Third "876" "another one" Features 8473 2654 Amount 6
I tried this:
import pandas as pd
import json
j = json.loads(json_file)
df = pd.json_normalize(j['products'])
df
id name emptylist properties.id properties.name properties.features
0 0 First [] NaN
1 1 Second [] 23 a useful product [{'name': 'Features', 'id': '18', 'features': ...
2 2 Third NaN 876 another one [{'name': 'Box', 'id': '937', 'features': [{'i...
And I tried to play a bit with the additional arguments, but I got nowhere. It seems like this is not the right way.
Can anyone help me?
I got a working solution with R, but I need to be able to do it with Python. If it helps, this would be the R code that I am trying to translate in Python.
library(tidyr)
jsonlite::fromJSON(json_file)$products %>%
jsonlite::flatten() %>%
unnest(properties.features , names_sep = ".", keep_empty = TRUE) %>%
unnest(properties.features.features, names_sep = ".", keep_empty = TRUE)
With the help of @piterbarg and some research I got to this solution:
j = json.loads(json_file)
df = pd.json_normalize(j['products'])
df1 = df.explode('properties.features')
df2 = pd.concat([df1.reset_index(drop=True).drop('properties.features', axis = 1),
df1['properties.features'].apply(pd.Series).reset_index(drop=True).add_prefix("properties.features.").drop("properties.features.0", axis = 1)], axis = 1)
df2 = df2.explode('properties.features.features')
df3 = pd.concat([df2.reset_index(drop=True).drop('properties.features.features', axis = 1),
df2['properties.features.features'].apply(pd.Series).reset_index(drop=True).add_prefix("properties.features.features.").drop("properties.features.features.0", axis = 1)], axis = 1)
df3
With this I get exactly the solution I'm looking for but the code looks pretty messy and I'm not sure how efficient this solution may be. Any help?
Upvotes: 1
Views: 408
Reputation: 9308
It is similar to what you have in Edit, but perhaps slightly shorter syntax and more performant.
If you have NaN in the DataFrame, older version of Pandas could fail on json_normalize
.
This solution should work with Pandas 1.3+.
df = pd.json_normalize(products)
df = df.explode('properties.features')
df = pd.concat([df.drop('properties.features', axis=1).reset_index(drop=True),
pd.json_normalize(df['properties.features']).add_prefix('properties.features.')], axis=1)
df = df.explode('properties.features.features')
df = pd.concat([df.drop('properties.features.features', axis=1).reset_index(drop=True),
pd.json_normalize(df['properties.features.features']).add_prefix('properties.features.features.')], axis=1)
Perf. with 1000 products.
Code in Edit : 4.85 s ± 218 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This solution: 58.3 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1
Reputation: 8219
This can be done with a repeated, if somewhat tedious, application of explode
to expand lists and apply(pd.Series)
to expand dicts:
df1 = df.explode('properties.features')
df2 = df1.join(df1['properties.features'].apply(pd.Series), lsuffix = '', rsuffix = '.properties.features').explode('features').drop(columns = 'properties.features')
df3 = df2.join(df2['features'].apply(pd.Series), lsuffix = '', rsuffix='.features').drop(columns = ['features','emptylist']).drop_duplicates()
df3
looks like this:
id name properties.id properties.name 0 id.properties.features name.properties.features 0.features id.features name.features value
-- ---- ------ --------------- ----------------- --- ------------------------ -------------------------- ------------ ------------- --------------- -------
0 0 First nan nan nan nan nan nan nan
1 1 Second 23 a useful product nan 18 Features nan 1001 Colour Black
1 1 Second 23 a useful product nan 18 Features nan 2093 Material Plastic
1 1 Second 23 a useful product nan 18 Features nan 4736 Length 56
1 1 Second 23 a useful product nan 18 Features nan 8745 Width 76
1 1 Second 23 a useful product nan 34 Sizes nan 1001 Colour Black
1 1 Second 23 a useful product nan 34 Sizes nan 2093 Material Plastic
1 1 Second 23 a useful product nan 34 Sizes nan 4736 Length 56
1 1 Second 23 a useful product nan 34 Sizes nan 8745 Width 76
2 2 Third 876 another one nan 937 Box nan 3758 Amount 1
2 2 Third 876 another one nan 937 Box nan 2222 Packaging Blister
2 2 Third 876 another one nan 937 Box nan 9372 Colour White
2 2 Third 876 another one nan 937 Box nan 9375 Position A
2 2 Third 876 another one nan 937 Box nan 2654 Amount 6
2 2 Third 876 another one nan 8473 Features nan 3758 Amount 1
2 2 Third 876 another one nan 8473 Features nan 2222 Packaging Blister
2 2 Third 876 another one nan 8473 Features nan 9372 Colour White
2 2 Third 876 another one nan 8473 Features nan 9375 Position A
2 2 Third 876 another one nan 8473 Features nan 2654 Amount 6
The names are not quite like you want them, this could be fixed with .rename(columns = {...})
if you want
Upvotes: 0