Optimize parsing file with JSON objects in pandas dataframe, where keys may be missing in some rows

Question

I'm looking to optimize the code below which takes ~5 seconds, which is too slow for a file of only 1000 lines.

I have a large file where each line contains valid JSON, with each JSON looking like the following (the actual data is much larger and nested, so I use this JSON snippet for illustration):

  {"location":{"town":"Rome","groupe":"Advanced",
    "school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}},
    "id":"145",
    "Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2,
    "Father":{"FatherName":"Peter","FatherAge":"51"},
    "Teacher":["MrCrock","MrDaniel"],"Field":"Marketing",
     "season":["summer","spring"]}

I need to parse this file in order to extract only some key-values from every JSON, to obtain the resulting dataframe:

Groupe      Id   MotherName   FatherName
Advanced    56   Laure         James
Middle      11   Ann           Nicolas
Advanced    6    Helen         Franc

But some keys I need in the dataframe, are missing in some JSON objects, so I should to verify if the key is present, and if not, fill the corresponding value with Null. I use with the following method:

df = pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open (path/to/file) as f:
    for chunk in f:
        jfile = json.loads(chunk)

        if 'groupe' in jfile['location']:
            groupe = jfile['location']['groupe']
        else:
            groupe=np.nan

        if 'id' in jfile:
            id = jfile['id']
        else:
            id = np.nan

        if 'MotherName' in jfile['Mother']:
            MotherName = jfile['Mother']['MotherName']
        else:
            MotherName = np.nan

        if 'FatherName' in jfile['Father']:
            FatherName = jfile['Father']['FatherName']
        else: 
            FatherName = np.nan

        df = df.append({"groupe":group, "id":id, "MotherName":MotherName, "FatherName":FatherName},
            ignore_index=True)

I need to optimize the runtime over the whole 1000-row file to <= 2 seconds. In PERL the same parsing function takes < 1 second, but I need to implement it in Python.

tdelaney · Accepted Answer

You'll get the best performance if you can build the dataframe in a single step during initialization. DataFrame.from_record takes a sequence of tuples which you can supply from a generator that reads one record at a time. You can parse the data faster with get, which will supply a default parameter when the item isn't found. I created an empty dict called dummy to pass for intermediate gets so that you know a chained get will work.

I created a 1000 record dataset and on my crappy laptop the time went from 18 seconds to .06 seconds. Thats pretty good.

import numpy as np
import pandas as pd
import json
import time

def extract_data(data):
    """ convert 1 json dict to records for import"""
    dummy = {}
    jfile = json.loads(data.strip())
    return (
        jfile.get('location', dummy).get('groupe', np.nan), 
        jfile.get('id', np.nan),
        jfile.get('Mother', dummy).get('MotherName', np.nan),
        jfile.get('Father', dummy).get('FatherName', np.nan))

start = time.time()
df = pd.DataFrame.from_records(map(extract_data, open('file.json')),
    columns=['group', 'id', 'Father', 'Mother'])
print('New algorithm', time.time()-start)

#
# The original way
#

start= time.time()
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open ('file.json') as f:
      for chunk in f:
           jfile=json.loads(chunk)
           if 'groupe' in jfile['location']:
               groupe=jfile['location']['groupe']
           else:
               groupe=np.nan
           if 'id' in jfile:
                id=jfile['id']
           else:
                id=np.nan
           if 'MotherName' in jfile['Mother']:
                MotherName=jfile['Mother']['MotherName']
           else:
                MotherName=np.nan
           if 'FatherName' in jfile['Father']:
                FatherName=jfile['Father']['FatherName']
           else: 
                FatherName=np.nan
           df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName},
            ignore_index=True)
print('original', time.time()-start)

Optimize parsing file with JSON objects in pandas dataframe, where keys may be missing in some rows

Answers (2)

Related Questions