Python: Reading 200k JSON files into a Pandas Dataframe

Question

I'm very new to python (< 2 weeks), and was asked to read in 200k+ JSON files I was provided (as-is) into a single database (using python). These JSON files have flat, one level attributes which vary across file from 50 - > 1000, but those 50 are a subset of the 1000.

Here is a snippet of a json file:

{
"study_type" : "Observational",
"intervention.intervention_type" : "Device",
"primary_outcome.time_frame" : "24 months",
"primary_completion_date.type" : "Actual",
"design_info.primary_purpose" : "Diagnostic",
"design_info.secondary_purpose" : "Intervention",
"start_date" : "January 2014",
"end_date" : "March 2014",
"overall_status" : "Completed",
"location_countries.country" : "United States",
"location.facility.name" : "Generic Institution",
}

Our goal is to take this master database of these JSON files, clean up the individual columns, run descriptive statistics on those columns and create a final, cleaned up database.

I'm coming from a SAS background so my thought was to use pandas and create a (very) large dataframe. I've been combing through stack overflow this past week and I've leveraged some learnings, but feel there has to be a way to make this way more efficient.

Below is the code I have written so far - it runs, but very slow (I estimate it will take days, if not weeks, to run even after eliminating unneeded input attributes/columns starting with "result').

In addition, the awkward way I convert the dictionary to a final table leaves the column index numbers above the column name, which I haven't been able to figure out how to remove.

import json, os
import pandas as pd    
from copy import deepcopy

path_to_json = '/home/ubuntu/json_flat/'

#Gets list of files in directory with *.json suffix
list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

#Initialize series
df_list = []

#For every json file found
for js in list_files:

    with open(os.path.join(path_to_json, js)) as data_file:
        data = json.loads(data_file.read())                         #Loads Json file into dictionary
        data_file.close()                                           #Close data file / remove from memory

        data_copy = deepcopy(data)                                  #Copies dictionary file
        for k in data_copy.keys():                                  #Iterate over copied dictionary file
            if k.startswith('result'):                              #If field starts with "X" then delete from dictionary
                del data[k]
        df = pd.Series(data)                                        #Convert Dictionary to Series
        df_list.append(df)                                          #Append to empty series  
        database = pd.concat(df_list, axis=1).reset_index()         #Concatenate series into database

output_db = database.transpose()                                    #Transpose rows/columns
output_db.to_csv('/home/ubuntu/output/output_db.csv', mode = 'w', index=False)

Any thoughts, advice is greatly appreciated. I am completely open to using a different technique or approach entirely (in python) if it's more efficient and still allows us to meet our objectives above.

Thanks!

patrick · Accepted Answer

I've tried to replicate your approach in a more concise manner, less copies and appending. It works with the example data you supplied but don't know if there are further intricacies in your data set. You can give this a try, I hope the comments help.

import json
import os
import pandas
import io


path_to_json = "XXX"

list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

#set up an empty dictionary
resultdict = {}

for fili in list_files:
    #the with avoids the extra step of closing the file
    with open(os.path.join(path_to_json, fili), "r") as inputjson:
        #the dictionary key is set to filename here, but you could also use e.g. a counter
        resultdict[fili] = json.load(inputjson)
        """
        you can exclude stuff here or later via dictionary comprehensions: 
        http://stackoverflow.com/questions/1747817/create-a-dictionary-with-list-comprehension-in-python
        e.g. as in your example code
        resultdict[fili] = {k:v for k,v in json.load(inputjson).items() if not k.startswith("result")}
        """

#put the whole thing into the DataFrame     
dataframe = pandas.DataFrame(resultdict)

#write out, transpose for desired format
with open("output.csv", "w") as csvout:
    dataframe.T.to_csv(csvout)

Python: Reading 200k JSON files into a Pandas Dataframe

Answers (2)

Related Questions