Nicola Caravaggio
Nicola Caravaggio

Reputation: 71

Convert Pandas DataFrame into multiple nested JSON

I am dealing with a DataFrame (imported from .csv) that I want to convert into a nested JSON, but I am not able to created an additional nested level. I am trying to explain with an example. At the very end of the work the result is imported into MongoDB with pymongo.

----------------------------------------------------
worker_id | gender | employer_id | year | job_type |
----------------------------------------------------
WORK_1    | M      | EMPL_2      | 1990 | Att      |
----------------------------------------------------   
WORK_1    | M      | EMPL_1      | 1991 | Mis      |
---------------------------------------------------- 
WORK_1    | M      | EMPL_1      | 1993 | Att      |
----------------------------------------------------     
WORK_2    | F      | EMPL_3      | 1995 | Att      |
----------------------------------------------------  
WORK_2    | F      | EMPL_3      | 1992 | Mis      |
---------------------------------------------------- 
WORK_2    | F      | EMPL_3      | 1994 | Att      |
----------------------------------------------------  

df = pd.DataFrame({'worker_id':['WORK_1','WORK_1','WORK_1','WORK_2','WORK_2','WORK_2'],
                    'gender':['M','M','M','F','F','F'],
                    'employer_id':['EMPL_2','EMPL_1','EMPL_1','EMPL_3','EMPL_3','EMPL_3'],
                    'year':[1990,1991,1993,1995,1992,1994],
                    'job_type':['Att','Mis','Att','Att','Mis','Att']
                    })

The JSON that I want to obtain should follow a structure like the following one:

{ "worker_id": "WORK_1",
  "gender": "M",
  "job_type" : [
               { "Att": [
                        { 
                          "employer_id": "EMPL_1",
                          "year": 1990
                        },
                        { 
                          "employer_id": "EMPL_2",
                          "year": 1993
                        }
                        ]      
               },      
               { "Mis": [
                        { 
                          "employer_id": "EMPL_1",
                          "year": 1991
                        }
                        ]      
               } 
               ]
},
{ "worker_id": "WORK_2",
  "gender": "F",
  "job_type" : [
               { "Att": [
                        { 
                          "employer_id": "EMPL_3",
                          "year": 1994
                        },
                        { 
                          "employer_id": "EMPL_3",
                          "year": 1995
                        }
                        ]      
               },      
               { "Mis": [
                        { 
                          "employer_id": "EMPL_3",
                          "year": 1992
                        }
                        ]      
               } 
               ]
}

I was able to nest within the category (array) of 'job_type' an object for each specific work contract (each row of the sample should represent a specific work contract, followed by several other variables) by following several helpful discussion here on Stack Overflow. Nonetheless, I want to differentiate also between kind of works (in the example between 'Mis' and 'Att'), then create another nested level.

The code that I used to nest among job contract between workers is the following one.

finalList = []
finalDict = {}

grouped = df.groupby(['worker_id',
                      'gender'
                     ])
for key, value in grouped:

    dictionary = {}

    j = grouped.get_group(key).reset_index(drop = True)
    dictionary['worker_id'] = j.at[0, 'worker_id']
    dictionary['gender'] = j.at[0, 'gender']

    dictList = []
    anotherDict = {}

    for i in j.index:

        anotherDict['employer_id'] = j.at[i, 'employer_id']       
        anotherDict['year'] = j.at[i, 'year']
        anotherDict['job_type'] = j.at[i, 'job_type']

        dictList.append(anotherDict.copy())

    dictionary['job_type'] = dictList

    finalList.append(dictionary)

I hope someone could kindly help me. Thank you in advance!

Update

I tried to enhance the code with the script here below (I followed this thread). Unfortunately, I am still not getting what I want.

# Generates a column for each kind of 'job_type'
df['att'] = ['Att' if x == 'Att' else None for x in df['job_type']] 
df['mis'] = ['Mis' if x == 'Mis' else None for x in df['job_type']] 

# Aggregate for the 'job_type' = 'Mis'
df_att = df.dropna(subset = ['att'])
df_att.drop(columns=['mis'])            
att = (df_att.groupby(['worker_id','gender'], as_index = True)
             .apply(lambda x: x[['employer_id','year','job_type']].to_dict('r'))
             .reset_index()
             .rename(columns = {0:'Att'}))

# Aggregate for the 'job_type' = 'Som'
df_mis= df.dropna(subset = ['mis']) 
df_mis.drop(columns=['att'])
mis = (df_mis.groupby(['worker_id','gender'], as_index = False)
             .apply(lambda x: x[['employer_id','year','job_type']].to_dict('r'))
             .reset_index()
             .rename(columns = {0:'Mis'}))

# Append
df_all = att.append(mis)

# Aggregate for 'worker_id' and 'gender'
j = (df_all.groupby(['worker_id','gender'], as_index = False)
             .apply(lambda x: x[['Att','Mis']].to_dict('r'))
             .reset_index()
             .rename(columns = {0:'job_type'})
             .to_json(orient = 'records'))

print(json.dumps(json.loads(j), indent = 4, sort_keys = True))

What I am getting so far...

[
    {
        "gender": "M",
        "job_type": [
            {
                "Att": [
                    {
                        "employer_id": "EMPL_2",
                        "job_type": "Att",
                        "year": 1990
                    },
                    {
                        "employer_id": "EMPL_1",
                        "job_type": "Att",
                        "year": 1993
                    }
                ],
                "Mis": null
            },
            {
                "Att": null,
                "Mis": [
                    {
                        "employer_id": "EMPL_1",
                        "job_type": "Mis",
                        "year": 1991
                    }
                ]
            }
        ],
        "worker_id": "WORK_1"
    },
    {
        "gender": "F",
        "job_type": [
            {
                "Att": [
                    {
                        "employer_id": "EMPL_3",
                        "job_type": "Att",
                        "year": 1995
                    },
                    {
                        "employer_id": "EMPL_3",
                        "job_type": "Att",
                        "year": 1994
                    }
                ],
                "Mis": null
            },
            {
                "Att": null,
                "Mis": [
                    {
                        "employer_id": "EMPL_3",
                        "job_type": "Mis",
                        "year": 1992
                    }
                ]
            }
        ],
        "worker_id": "WORK_2"
    }
]

Upvotes: 0

Views: 1037

Answers (1)

chris
chris

Reputation: 1322

Here's a solution that loops over the unique worker_id values and builds up a list of dictionaries for each worker_id:

import pandas as pd
import json
df = pd.DataFrame({'worker_id':['WORK_1','WORK_1','WORK_1','WORK_2','WORK_2','WORK_2'],
                    'gender':['M','M','M','F','F','F'],
                    'employer_id':['EMPL_2','EMPL_1','EMPL_1','EMPL_3','EMPL_3','EMPL_3'],
                    'year':[1990,1991,1993,1995,1992,1994],
                    'job_type':['Att','Mis','Att','Att','Mis','Att']})

df_G=df[['worker_id','gender']].drop_duplicates()
all_dicts=[]
for indx,vals in df_G.iterrows():
    this_dict=vals.to_dict()
    job_dict=(df[df.worker_id==vals['worker_id']]
        .groupby(['job_type']).apply(lambda x: x[['employer_id','year']]
        .to_dict('r')).to_dict())
    this_dict['job_type']=[]
    for key,val in job_dict.items():         
        print({key:val})
        this_dict['job_type'].append({key:val})
    all_dicts.append(this_dict)

(df[df.worker_id==vals['worker_id']].groupby(['job_type']).apply(lambda x: x[['employer_id','year']].to_dict('r')))

print(json.dumps(all_dicts, indent = 4, sort_keys = True))

Prints out:

[
    {
        "gender": "M",
        "job_type": [
            {
                "Mis": [
                    {
                        "employer_id": "EMPL_1",
                        "year": 1991
                    }
                ]
            },
            {
                "Att": [
                    {
                        "employer_id": "EMPL_2",
                        "year": 1990
                    },
                    {
                        "employer_id": "EMPL_1",
                        "year": 1993
                    }
                ]
            }
        ],
        "worker_id": "WORK_1"
    },
    {
        "gender": "F",
        "job_type": [
            {
                "Mis": [
                    {
                        "employer_id": "EMPL_3",
                        "year": 1992
                    }
                ]
            },
            {
                "Att": [
                    {
                        "employer_id": "EMPL_3",
                        "year": 1995
                    },
                    {
                        "employer_id": "EMPL_3",
                        "year": 1994
                    }
                ]
            }
        ],
        "worker_id": "WORK_2"
    }
]

Perhaps not the most efficient or pythonic, but it works. And if I remember pymongo correctly you can pass it the list of dicts to insert.

Upvotes: 1

Related Questions