Reputation: 33
I have found a numerous number of similar questions on stackoverflow, however, one issue remains unsolved to me. I have a heavily nested “.json” file I need to import and convert into R or Python data.frame to work with. Json file contains lists inside (usually empty but sometime contains data). Example of json's structure:
I use R's library jsonlite
and Python's pandas
.
# R
jsonlite::fromJSON(json_file, flatten = TRUE)
# or
jsonlite::read_json(json_file, simplifyVector = T)
# Python
with open(json_file.json, encoding = "utf-8") as f:
data = json.load(f)
pd.json_normalize(data)
Generally, in both cases it work. The output looks like a normal data.frame, however, the problem is that some columns of a new data.frame contain an embedded lists (I am not sure about "embedded lists" whether it's correct and clear). Seems that both Pandas and jsonlite combined each list into single column, which is clearly seen in the screens below.
R
Python
As you might see some columns, such as wymagania.wymaganiaKonieczne.wyksztalcenia is nothing but a vector contains a combined/embedded list, i.e. content of a list has been combined into single column.
As a desired output I want to split each element of such lists as a single column of a data.frame. In other words, I want to obtain normal “in tidy sense” data.frame without any nested either data.frames and lists. Both R and Python codes are appreciated.
Minimum reproducible example:
[
{
"warunkiPracyIPlacy":{"miejscePracy":"abc","rodzajObowiazkow":"abc","zakresObowiazkow":"abc","rodzajZatrudnienia":"abc","kodRodzajuZatrudnienia":"abc","zmianowosc":"abc"},
"wymagania":{
"wymaganiaKonieczne":{
"zawody":[],
"wyksztalcenia":["abc"],
"wyksztalceniaSzczegoly":[{"kodPoziomuWyksztalcenia":"RPs002|WY","kodTypuWyksztalcenia":"abc"}],
"jezyki":[],
"jezykiSzczegoly":[],
"uprawnienia":[]},
"wymaganiaPozadane":{
"zawody":[],
"zawodySzczegoly":[],
"staze":[]},
"wymaganiaDodatkowe":{"zawody":[],"zawodySzczegoly":[]},
"inneWymagania":"abc"
},
"danePracodawcy":{"pracodawca":"abc","nip":"abc","regon":"abc","branza":null},
"pozostaleDane":{"identyfikatorOferty":"abc","ofertaZgloszonaPrzez":"abc","ofertaZgloszonaPrzezKodJednostki":"abc"},
"typOferty":"abc",
"typOfertyNaglowek":"abc",
"rodzajOferty":["DLA_ZAREJESTROWANYCH"],"staz":false,"link":false}
]
Upvotes: 1
Views: 1067
Reputation: 369
This is an answer for Python. It is not very elegant, but I think it will do for your purpose.
I have called your example file nested_json.json
import json
import pandas as pd
json_file = "nested_json.json"
with open(json_file, encoding="utf-8") as f:
data = json.load(f)
df = pd.json_normalize(data)
df_exploded = df.apply(lambda x: x.explode()).reset_index(drop=True)
# check based on first row whether its of type dict
columns_dict = df_exploded.columns[df_exploded.apply(lambda x: isinstance(x[0], dict))]
# append the splitted dict to the dataframe
for col in columns_dict:
df_splitted_dict = df_exploded[col].apply(pd.Series)
df_exploded = pd.concat([df_exploded, df_splitted_dict], axis=1)
This leads to a rectangular dataframe
>>> df_exploded.T
0
typOferty abc
typOfertyNaglowek abc
rodzajOferty DLA_ZAREJESTROWANYCH
staz False
link False
warunkiPracyIPlacy.miejscePracy abc
warunkiPracyIPlacy.rodzajObowiazkow abc
warunkiPracyIPlacy.zakresObowiazkow abc
warunkiPracyIPlacy.rodzajZatrudnienia abc
warunkiPracyIPlacy.kodRodzajuZatrudnienia abc
warunkiPracyIPlacy.zmianowosc abc
wymagania.wymaganiaKonieczne.zawody NaN
wymagania.wymaganiaKonieczne.wyksztalcenia abc
wymagania.wymaganiaKonieczne.wyksztalceniaSzcze... {'kodPoziomuWyksztalcenia': 'RPs002|WY', 'kodT...
wymagania.wymaganiaKonieczne.jezyki NaN
wymagania.wymaganiaKonieczne.jezykiSzczegoly NaN
wymagania.wymaganiaKonieczne.uprawnienia NaN
wymagania.wymaganiaPozadane.zawody NaN
wymagania.wymaganiaPozadane.zawodySzczegoly NaN
wymagania.wymaganiaPozadane.staze NaN
wymagania.wymaganiaDodatkowe.zawody NaN
wymagania.wymaganiaDodatkowe.zawodySzczegoly NaN
wymagania.inneWymagania abc
danePracodawcy.pracodawca abc
danePracodawcy.nip abc
danePracodawcy.regon abc
danePracodawcy.branza None
pozostaleDane.identyfikatorOferty abc
pozostaleDane.ofertaZgloszonaPrzez abc
pozostaleDane.ofertaZgloszonaPrzezKodJednostki abc
kodPoziomuWyksztalcenia RPs002|WY
kodTypuWyksztalcenia abc
Upvotes: 2