Reputation: 43
I'm very new to python (< 2 weeks), and was asked to read in 200k+ JSON files I was provided (as-is) into a single database (using python). These JSON files have flat, one level attributes which vary across file from 50 - > 1000, but those 50 are a subset of the 1000.
Here is a snippet of a json file:
{
"study_type" : "Observational",
"intervention.intervention_type" : "Device",
"primary_outcome.time_frame" : "24 months",
"primary_completion_date.type" : "Actual",
"design_info.primary_purpose" : "Diagnostic",
"design_info.secondary_purpose" : "Intervention",
"start_date" : "January 2014",
"end_date" : "March 2014",
"overall_status" : "Completed",
"location_countries.country" : "United States",
"location.facility.name" : "Generic Institution",
}
Our goal is to take this master database of these JSON files, clean up the individual columns, run descriptive statistics on those columns and create a final, cleaned up database.
I'm coming from a SAS background so my thought was to use pandas and create a (very) large dataframe. I've been combing through stack overflow this past week and I've leveraged some learnings, but feel there has to be a way to make this way more efficient.
Below is the code I have written so far - it runs, but very slow (I estimate it will take days, if not weeks, to run even after eliminating unneeded input attributes/columns starting with "result').
In addition, the awkward way I convert the dictionary to a final table leaves the column index numbers above the column name, which I haven't been able to figure out how to remove.
import json, os
import pandas as pd
from copy import deepcopy
path_to_json = '/home/ubuntu/json_flat/'
#Gets list of files in directory with *.json suffix
list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
#Initialize series
df_list = []
#For every json file found
for js in list_files:
with open(os.path.join(path_to_json, js)) as data_file:
data = json.loads(data_file.read()) #Loads Json file into dictionary
data_file.close() #Close data file / remove from memory
data_copy = deepcopy(data) #Copies dictionary file
for k in data_copy.keys(): #Iterate over copied dictionary file
if k.startswith('result'): #If field starts with "X" then delete from dictionary
del data[k]
df = pd.Series(data) #Convert Dictionary to Series
df_list.append(df) #Append to empty series
database = pd.concat(df_list, axis=1).reset_index() #Concatenate series into database
output_db = database.transpose() #Transpose rows/columns
output_db.to_csv('/home/ubuntu/output/output_db.csv', mode = 'w', index=False)
Any thoughts, advice is greatly appreciated. I am completely open to using a different technique or approach entirely (in python) if it's more efficient and still allows us to meet our objectives above.
Thanks!
Upvotes: 4
Views: 2398
Reputation: 4862
I've tried to replicate your approach in a more concise manner, less copies and appending. It works with the example data you supplied but don't know if there are further intricacies in your data set. You can give this a try, I hope the comments help.
import json
import os
import pandas
import io
path_to_json = "XXX"
list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
#set up an empty dictionary
resultdict = {}
for fili in list_files:
#the with avoids the extra step of closing the file
with open(os.path.join(path_to_json, fili), "r") as inputjson:
#the dictionary key is set to filename here, but you could also use e.g. a counter
resultdict[fili] = json.load(inputjson)
"""
you can exclude stuff here or later via dictionary comprehensions:
http://stackoverflow.com/questions/1747817/create-a-dictionary-with-list-comprehension-in-python
e.g. as in your example code
resultdict[fili] = {k:v for k,v in json.load(inputjson).items() if not k.startswith("result")}
"""
#put the whole thing into the DataFrame
dataframe = pandas.DataFrame(resultdict)
#write out, transpose for desired format
with open("output.csv", "w") as csvout:
dataframe.T.to_csv(csvout)
Upvotes: 3
Reputation: 249582
Your most critical performance bug is probably this:
database = pd.concat(df_list, axis=1).reset_index()
You do this in a loop, every time adding one more thing to df_list
and then concat again. But there is no use of this "database" variable until the end, so you can do this step just once, outside the loop.
With Pandas, "concat" in a loop is a huge anti-pattern. Build your list in the loop, concat once.
The second thing is that you should use Pandas to read the JSON files also: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
Keep it simple. Write a function that takes a path, calls pd.read_json()
, removes rows you don't need (series.str.startswith()
), etc.
Once you have that working well, your next step will be to check whether you are CPU limited (CPU usage 100%), or I/O limited (CPU usage much less than 100%).
Upvotes: 1