Reputation: 329
I want to import a json-file (even though it looks more like a plain txt-file) where each line is a small (four data pairs) json file. Every json file should be a row in the pandas Dataframe with four columns.
Example:
# Inside "data.json"
{"time": "2020-07-01:14:27:16.0000", "id": "m38dk117", "position": "66277", "active_current": "17.1"}
{"time": "2020-07-01:14:27:16.0000", "id": "m38dk118", "position": "3277", "active_current": "0.0"}
...
{"time": "2020-07-30:14:27:16.0000", "id": "m38dk006", "position": "73117", "active_current": "0.0"}
data.json is approx 30MB large and contains roughly 250.000 lines - for each day
data.json is approx 900MB large and contains roughly 7.5M lines - for each month
The following code snippet does do the job but is far too slow. Alternative options to pandas are welcome too, i am not limited to pandas. But inexperienced dealing with huge amounts of log-data.
Proposition:
import pandas as pd
import json
df = pd.DataFrame()
with open('data.json', 'r') as stacked_json_file:
row_idx = -1
for json_file in stacked_json_file:
row_idx += 1
df = df.append(pd.DataFrame(json.loads(json_file), index = [row_idx]))
Is this possibly slow because pd.DataFrame.append does not append IN-Place?
Upvotes: 1
Views: 391
Reputation: 5036
This should work
pd.read_json('data.json', lines=True)
Out:
time id position active_current
0 2020-07-01:14:27:16.0000 m38dk117 66277 17.1
1 2020-07-01:14:27:16.0000 m38dk118 3277 0.0
2 2020-07-30:14:27:16.0000 m38dk006 73117 0.0
If this doesn't work:
Assuming your data is not valid json, but a text file with a json object in every line. You can create a list of dicts first to avoid appending to the dataframe
with open('data.json', 'r') as f:
dictlist = [json.loads(x) for x in f]
pd.DataFrame(dictlist)
Out:
time id position active_current
0 2020-07-01:14:27:16.0000 m38dk117 66277 17.1
1 2020-07-01:14:27:16.0000 m38dk118 3277 0.0
2 2020-07-30:14:27:16.0000 m38dk006 73117 0.0
Comparing iterative append, creating a list of dictionaries and pd.read_json
.
Result
Code used for the benchmark
import pandas as pd
import json
import io
def makedata(n):
t = '''{"time": "2020-07-01:14:27:16.0000", "id": "m38dk117", "position": "66277", "active_current": "17.1"}
{"time": "2020-07-01:14:27:16.0000", "id": "m38dk118", "position": "3277", "active_current": "0.0"}
{"time": "2020-07-30:14:27:16.0000", "id": "m38dk006", "position": "73117", "active_current": "0.0"}
''' * n
return t
def pdjson(file):
return pd.read_json(io.StringIO(file), lines=True)
def dictlist(file):
with io.StringIO(file) as f:
l = [json.loads(x) for x in f]
return pd.DataFrame(l)
def appenddf(file):
df = pd.DataFrame()
with io.StringIO(file) as stacked_json_file:
row_idx = -1
for json_file in stacked_json_file:
row_idx += 1
df = df.append(pd.DataFrame(json.loads(json_file), index = [row_idx]))
return df
import perfplot
perfplot.show(
setup = makedata,
kernels = [appenddf, dictlist, pdjson],
n_range= [2**k for k in range(5,16)],
equality_check=None,
xlabel='len(df)'
);
Upvotes: 4