epiphany
epiphany

Reputation: 766

How to join json with Pandas

The goal here is to count the number of patients for each type of diagnosis. In patient record, visit id is unique, yet in diagnosis record, since one visit could have multiple diagnosis, the same visit id could have multiple diagnosis id.

To do so, i think the 2 data frame need to be linked with the field visit id. Could anyone kindly shed some light on how to link the 2 json via Pandas and count the number of patients for each diagnosis. Many thanks

Patient record

enter image description here

JSON [Patient record]

[
 {
   "Doctor id":"AU1254",
   "Patient":[
      {
         "Patient id":"BK1221",
         "Patient name":"Tim"
      }
   ],  
   "Visit id":"B0001"       
},
 {
   "Doctor id":"AU8766",
   "Patient":[
      {
         "Patient id":"BK1209",
         "Patient name":"Sue"
      }
   ],  
   "Visit id":"B0002"  
},
 {
   "Doctor id":"AU1254",
   "Patient":[
      {
         "Patient id":"BK1323",
         "Patient name":"Sary"
      }
   ],  
   "Visit id":"B0003"  
  }
]

Diagnosis record

enter image description here

JSON [Diagnosis record]

[
   {
      "Visit id":"B0001",
      "Diagnosis":[
         {
            "diagnosis id":"D1001",
            "diagnosis name":"fever"           
         },
         {
            "diagnosis id":"D1987",
            "diagnosis name":"cough"
         },
         {
             "diagnosis id":"D1265",
            "diagnosis name":"running nose"
         }
      ]
   }, 
      {
      "Visit id":"B0002",
      "Diagnosis":[
         {
            "diagnosis id":"D1987",
            "diagnosis name":"cough"           
         },
         {
            "diagnosis id":"D1453",
            "diagnosis name":"stomach ache"
         }
      ]
   } 
]

Upvotes: 1

Views: 384

Answers (3)

Danail Petrov
Danail Petrov

Reputation: 1875

You can use left merge() on visit id. merge

>  from pandas.io.json import json_normalize
>  import json
>  json1 = <your first json here>
>  json2 = <your second json here>
>  df1=pd.json_normalize(data=json.loads(json1), record_path='Patient', meta=['Doctor id','Visit id'])
>  df2=pd.json_normalize(data=json.loads(json2), record_path='Diagnosis', meta=['Visit id'])


>  print(df1.merge(df2, on='Visit id', how='left').dropna())
  Patient id Patient name Doctor id Visit id diagnosis id diagnosis name
0     BK1221          Tim    AU1254    B0001        D1001          fever
1     BK1221          Tim    AU1254    B0001        D1987          cough
2     BK1221          Tim    AU1254    B0001        D1265   running nose
3     BK1209          Sue    AU8766    B0002        D1987          cough
4     BK1209          Sue    AU8766    B0002        D1453   stomach ache

You can also do some fancy grouping / printing

pd.pivot_table(df3, index=['Patient id','Patient name','Doctor id','Visit id'], values=['diagnosis id','diagnosis name'], aggfunc=list)
                                                     diagnosis id                diagnosis name
Patient id Patient name Doctor id Visit id
BK1209     Sue          AU8766    B0002            [D1987, D1453]         [cough, stomach ache]
BK1221     Tim          AU1254    B0001     [D1001, D1987, D1265]  [fever, cough, running nose]

And count per diagnosis / per patient

df3.groupby(['diagnosis id', 'diagnosis name']).agg({'Patient name': [list, 'count']})
                            Patient name
                                    list count
diagnosis id diagnosis name
D1001        fever                 [Tim]     1
D1265        running nose          [Tim]     1
D1453        stomach ache          [Sue]     1
D1987        cough            [Tim, Sue]     2

Upvotes: 4

Pygirl
Pygirl

Reputation: 13349

Try: (x--> JSON [Patient record], y-->JSON [Diagnosis record]

df = pd.DataFrame(x)
df = pd.concat([df.pop('Patient').apply(lambda x: pd.Series(x[0])), df], axis=1)

df1 = pd.DataFrame(y)

df1 = pd.concat([df1.explode('Diagnosis')['Diagnosis'].apply(pd.Series), df1], axis=1)
df1.pop('Diagnosis')

df_merge = pd.merge(df,df1, on='Visit id', how='right')

df_merge:

    Patient id  Patient name  Doctor id Visit id  diagnosis id diagnosis name
0   BK1221      Tim           AU1254    B0001     D1001        fever
1   BK1221      Tim           AU1254    B0001     D1987        cough
2   BK1221      Tim           AU1254    B0001     D1265        running nose
3   BK1209      Sue           AU8766    B0002     D1987        cough
4   BK1209      Sue           AU8766    B0002     D1453        stomach ache

To count:

df_merge.groupby('diagnosis name')['Patient id'].count()

Edit:

Try:

df_merge.groupby('diagnosis name').agg({'Patient name': [list, 'count']}).reset_index()

diagnosis name  Patient name
                list        count
        cough   [Tim, Sue]  2
        fever   [Tim]       1
running nose    [Tim]       1
stomach ache    [Sue]       1

Upvotes: 1

smaxwell
smaxwell

Reputation: 89

Try the following for the patient records.

patients_df = pd.read_json(patients.json)

patient_id = []
patient_name =[]

# Get attributes from nested nested datatypes in Patient column
for patient in patients_df['Patients']:
    patient_id = patient[0]['Patient id']
    patient_name = patient[0]['Patient name']

# Add to the pandas dataframe
patients_df['Patient name'] = patient_name
patient_df['Patient id'] = patient_id

# Drop the 'Patient' column
patients_df = patients_df.drop(column='Patient')

Upvotes: 0

Related Questions