Reputation: 3591
I have website logs saved as json, and I want to load them in pandas. I have this kind of json structure, with multiple nested data :
{"settings":{"siteIdentifier":"site1"},
"event":{"name":"pageview",
"properties":[]},
"context":{"date":"Thu Dec 01 2016 01:00:08 GMT+0100 (CET)",
"location":{"hash":"",
"host":"aaa"},
"screen":{"availHeight":876,
"orientation":{"angle":0,
"type":"landscape-primary"}},
"navigator":{"appCodeName":"Mozilla",
"vendorSub":""},
"visitor":{"id": "unique_id"}},
"server":{"HTTP_COOKIE":"uid",
"date":"2016-12-01T00:00:09+00:00"}}
{"settings":{"siteIdentifier":"site2"},
"event":{"name":"pageview",
"properties":[]},
"context":{"date":"Thu Dec 01 2016 01:00:10 GMT+0100 (CET)",
"location":{"hash":"",
"host":"aaa"},
"screen":{"availHeight":852,
"orientation":{"angle":90,
"type":"landscape-primary"}},
"navigator":{"appCodeName":"Mozilla",
"vendorSub":""},
"visitor":{"id": "unique_id"}},
"server":{"HTTP_COOKIE":"uid",
"date":"2016-12-01T00:00:09+00:10"}}
The only working solution for now is :
import pandas as pd
import json
from pandas.io.json import json_normalize
pd.set_option('expand_frame_repr', False)
pd.set_option('display.max_columns', 10)
pd.set_option("display.max_rows",30)
first = True
filename = "/path/to/file.json"
with open(filename, 'r') as f:
for line in f: # read line by line to retrieve only one json
data = json.loads(line) # convert single json from string to json
if first: # initialize the dataframe
df = json_normalize(data)
first = False
else: # add a row for each json
df=df.append(json_normalize(data)) #normalize to flatten the data
df.to_csv("2016-12-02.csv",index=False, encoding='utf-8')
I have to read line by line because my jsons are just pasted one after the other and not in a list. My code is working but it is very very slow. What can I do to improve it? I use pandas because it seems appropriate, but if there is another way it's ok.
Upvotes: 0
Views: 178
Reputation: 36033
You can put all the JSON objects into a single iterable first:
with open(filename, 'r') as f:
data = [json.loads(line) for line in f]
df = json_normalize(data)
df.to_csv("2016-12-02.csv",index=False, encoding='utf-8')
Upvotes: 2