Reputation: 71
I am dealing with a DataFrame (imported from .csv) that I want to convert into a nested JSON, but I am not able to created an additional nested level. I am trying to explain with an example. At the very end of the work the result is imported into MongoDB with pymongo.
----------------------------------------------------
worker_id | gender | employer_id | year | job_type |
----------------------------------------------------
WORK_1 | M | EMPL_2 | 1990 | Att |
----------------------------------------------------
WORK_1 | M | EMPL_1 | 1991 | Mis |
----------------------------------------------------
WORK_1 | M | EMPL_1 | 1993 | Att |
----------------------------------------------------
WORK_2 | F | EMPL_3 | 1995 | Att |
----------------------------------------------------
WORK_2 | F | EMPL_3 | 1992 | Mis |
----------------------------------------------------
WORK_2 | F | EMPL_3 | 1994 | Att |
----------------------------------------------------
df = pd.DataFrame({'worker_id':['WORK_1','WORK_1','WORK_1','WORK_2','WORK_2','WORK_2'],
'gender':['M','M','M','F','F','F'],
'employer_id':['EMPL_2','EMPL_1','EMPL_1','EMPL_3','EMPL_3','EMPL_3'],
'year':[1990,1991,1993,1995,1992,1994],
'job_type':['Att','Mis','Att','Att','Mis','Att']
})
The JSON that I want to obtain should follow a structure like the following one:
{ "worker_id": "WORK_1",
"gender": "M",
"job_type" : [
{ "Att": [
{
"employer_id": "EMPL_1",
"year": 1990
},
{
"employer_id": "EMPL_2",
"year": 1993
}
]
},
{ "Mis": [
{
"employer_id": "EMPL_1",
"year": 1991
}
]
}
]
},
{ "worker_id": "WORK_2",
"gender": "F",
"job_type" : [
{ "Att": [
{
"employer_id": "EMPL_3",
"year": 1994
},
{
"employer_id": "EMPL_3",
"year": 1995
}
]
},
{ "Mis": [
{
"employer_id": "EMPL_3",
"year": 1992
}
]
}
]
}
I was able to nest within the category (array) of 'job_type' an object for each specific work contract (each row of the sample should represent a specific work contract, followed by several other variables) by following several helpful discussion here on Stack Overflow. Nonetheless, I want to differentiate also between kind of works (in the example between 'Mis' and 'Att'), then create another nested level.
The code that I used to nest among job contract between workers is the following one.
finalList = []
finalDict = {}
grouped = df.groupby(['worker_id',
'gender'
])
for key, value in grouped:
dictionary = {}
j = grouped.get_group(key).reset_index(drop = True)
dictionary['worker_id'] = j.at[0, 'worker_id']
dictionary['gender'] = j.at[0, 'gender']
dictList = []
anotherDict = {}
for i in j.index:
anotherDict['employer_id'] = j.at[i, 'employer_id']
anotherDict['year'] = j.at[i, 'year']
anotherDict['job_type'] = j.at[i, 'job_type']
dictList.append(anotherDict.copy())
dictionary['job_type'] = dictList
finalList.append(dictionary)
I hope someone could kindly help me. Thank you in advance!
Update
I tried to enhance the code with the script here below (I followed this thread). Unfortunately, I am still not getting what I want.
# Generates a column for each kind of 'job_type'
df['att'] = ['Att' if x == 'Att' else None for x in df['job_type']]
df['mis'] = ['Mis' if x == 'Mis' else None for x in df['job_type']]
# Aggregate for the 'job_type' = 'Mis'
df_att = df.dropna(subset = ['att'])
df_att.drop(columns=['mis'])
att = (df_att.groupby(['worker_id','gender'], as_index = True)
.apply(lambda x: x[['employer_id','year','job_type']].to_dict('r'))
.reset_index()
.rename(columns = {0:'Att'}))
# Aggregate for the 'job_type' = 'Som'
df_mis= df.dropna(subset = ['mis'])
df_mis.drop(columns=['att'])
mis = (df_mis.groupby(['worker_id','gender'], as_index = False)
.apply(lambda x: x[['employer_id','year','job_type']].to_dict('r'))
.reset_index()
.rename(columns = {0:'Mis'}))
# Append
df_all = att.append(mis)
# Aggregate for 'worker_id' and 'gender'
j = (df_all.groupby(['worker_id','gender'], as_index = False)
.apply(lambda x: x[['Att','Mis']].to_dict('r'))
.reset_index()
.rename(columns = {0:'job_type'})
.to_json(orient = 'records'))
print(json.dumps(json.loads(j), indent = 4, sort_keys = True))
What I am getting so far...
[
{
"gender": "M",
"job_type": [
{
"Att": [
{
"employer_id": "EMPL_2",
"job_type": "Att",
"year": 1990
},
{
"employer_id": "EMPL_1",
"job_type": "Att",
"year": 1993
}
],
"Mis": null
},
{
"Att": null,
"Mis": [
{
"employer_id": "EMPL_1",
"job_type": "Mis",
"year": 1991
}
]
}
],
"worker_id": "WORK_1"
},
{
"gender": "F",
"job_type": [
{
"Att": [
{
"employer_id": "EMPL_3",
"job_type": "Att",
"year": 1995
},
{
"employer_id": "EMPL_3",
"job_type": "Att",
"year": 1994
}
],
"Mis": null
},
{
"Att": null,
"Mis": [
{
"employer_id": "EMPL_3",
"job_type": "Mis",
"year": 1992
}
]
}
],
"worker_id": "WORK_2"
}
]
Upvotes: 0
Views: 1037
Reputation: 1322
Here's a solution that loops over the unique worker_id
values and builds up a list of dictionaries for each worker_id
:
import pandas as pd
import json
df = pd.DataFrame({'worker_id':['WORK_1','WORK_1','WORK_1','WORK_2','WORK_2','WORK_2'],
'gender':['M','M','M','F','F','F'],
'employer_id':['EMPL_2','EMPL_1','EMPL_1','EMPL_3','EMPL_3','EMPL_3'],
'year':[1990,1991,1993,1995,1992,1994],
'job_type':['Att','Mis','Att','Att','Mis','Att']})
df_G=df[['worker_id','gender']].drop_duplicates()
all_dicts=[]
for indx,vals in df_G.iterrows():
this_dict=vals.to_dict()
job_dict=(df[df.worker_id==vals['worker_id']]
.groupby(['job_type']).apply(lambda x: x[['employer_id','year']]
.to_dict('r')).to_dict())
this_dict['job_type']=[]
for key,val in job_dict.items():
print({key:val})
this_dict['job_type'].append({key:val})
all_dicts.append(this_dict)
(df[df.worker_id==vals['worker_id']].groupby(['job_type']).apply(lambda x: x[['employer_id','year']].to_dict('r')))
print(json.dumps(all_dicts, indent = 4, sort_keys = True))
Prints out:
[
{
"gender": "M",
"job_type": [
{
"Mis": [
{
"employer_id": "EMPL_1",
"year": 1991
}
]
},
{
"Att": [
{
"employer_id": "EMPL_2",
"year": 1990
},
{
"employer_id": "EMPL_1",
"year": 1993
}
]
}
],
"worker_id": "WORK_1"
},
{
"gender": "F",
"job_type": [
{
"Mis": [
{
"employer_id": "EMPL_3",
"year": 1992
}
]
},
{
"Att": [
{
"employer_id": "EMPL_3",
"year": 1995
},
{
"employer_id": "EMPL_3",
"year": 1994
}
]
}
],
"worker_id": "WORK_2"
}
]
Perhaps not the most efficient or pythonic, but it works. And if I remember pymongo correctly you can pass it the list of dicts to insert.
Upvotes: 1