Simon B
Simon B

Reputation: 329

Pandas Dataframe from stacked json files - Speed issue

I want to import a json-file (even though it looks more like a plain txt-file) where each line is a small (four data pairs) json file. Every json file should be a row in the pandas Dataframe with four columns.

Example:

# Inside "data.json"

{"time": "2020-07-01:14:27:16.0000", "id": "m38dk117", "position": "66277", "active_current": "17.1"}
{"time": "2020-07-01:14:27:16.0000", "id": "m38dk118", "position": "3277", "active_current": "0.0"}
...
{"time": "2020-07-30:14:27:16.0000", "id": "m38dk006", "position": "73117", "active_current": "0.0"}

data.json is approx 30MB large and contains roughly 250.000 lines - for each day

data.json is approx 900MB large and contains roughly 7.5M lines - for each month

The following code snippet does do the job but is far too slow. Alternative options to pandas are welcome too, i am not limited to pandas. But inexperienced dealing with huge amounts of log-data.

Proposition:

import pandas as pd
import json

df = pd.DataFrame()
with open('data.json', 'r') as stacked_json_file:
    row_idx = -1

    for json_file in stacked_json_file:
        row_idx += 1
        
        df = df.append(pd.DataFrame(json.loads(json_file), index = [row_idx]))

Is this possibly slow because pd.DataFrame.append does not append IN-Place?

Upvotes: 1

Views: 391

Answers (1)

Michael Szczesny
Michael Szczesny

Reputation: 5036

This should work

pd.read_json('data.json', lines=True)

Out:

                       time        id  position  active_current
0  2020-07-01:14:27:16.0000  m38dk117     66277            17.1
1  2020-07-01:14:27:16.0000  m38dk118      3277             0.0
2  2020-07-30:14:27:16.0000  m38dk006     73117             0.0

If this doesn't work:

Assuming your data is not valid json, but a text file with a json object in every line. You can create a list of dicts first to avoid appending to the dataframe

with open('data.json', 'r') as f:
  dictlist = [json.loads(x) for x in f]
pd.DataFrame(dictlist)

Out:

                       time        id position active_current
0  2020-07-01:14:27:16.0000  m38dk117    66277           17.1
1  2020-07-01:14:27:16.0000  m38dk118     3277            0.0
2  2020-07-30:14:27:16.0000  m38dk006    73117            0.0

Micro-Benchmark

Comparing iterative append, creating a list of dictionaries and pd.read_json.

Result

benchmark results

Code used for the benchmark

import pandas as pd
import json
import io

def makedata(n):
  t = '''{"time": "2020-07-01:14:27:16.0000", "id": "m38dk117", "position": "66277", "active_current": "17.1"}
{"time": "2020-07-01:14:27:16.0000", "id": "m38dk118", "position": "3277", "active_current": "0.0"}
{"time": "2020-07-30:14:27:16.0000", "id": "m38dk006", "position": "73117", "active_current": "0.0"}
''' * n
  return t

def pdjson(file):
  return pd.read_json(io.StringIO(file), lines=True)

def dictlist(file):
  with io.StringIO(file) as f:
    l = [json.loads(x) for x in f]
  return pd.DataFrame(l)

def appenddf(file):
  df = pd.DataFrame()
  with io.StringIO(file) as stacked_json_file:
      row_idx = -1

      for json_file in stacked_json_file:
          row_idx += 1
          
          df = df.append(pd.DataFrame(json.loads(json_file), index = [row_idx]))
  return df

import perfplot
perfplot.show(
    setup = makedata,
    kernels = [appenddf, dictlist, pdjson],
    n_range= [2**k for k in range(5,16)],
    equality_check=None,
    xlabel='len(df)'
);

Upvotes: 4

Related Questions