David
David

Reputation: 1202

Pandas select which rows to read JSON

I have a huge JSON file (lots of smaller .log (JSON format) files combined together to a total of 8Gb), composed of multiple different objects (where every object takes a row). I want to read this file into a pandas dataframe. I am only interested in collecting the JSON entries for one specific object (this would drastically reduce the file size to read). Can this be done with pandas or python before reading in a dataframe?

My current code is as follows:

import pandas as pd
import glob

df = pd.concat([pd.read_json(f, encoding = "ISO-8859-1", lines=True) for f in glob.glob("logs/sample1/*.log")], ignore_index=True)

As you might imagine, this is very computationally heavy, and takes a lot of time to complete. Is there a way to process this before reading it in a dataframe?

Sample of Data:

{"Name": "1","variable": "value","X": {"nested_var": 5000,"nested_var2": 2000}}
{"Name": "2","variable": "value","X": {"nested_var": 1222,"nested_var2": 8465}}
{"Name": "2","variable": "value","X": {"nested_var": 123,"nested_var2": 865}}
{"Name": "1","variable": "value","X": {"nested_var": 5500,"nested_var2": 2070}}
{"Name": "2","variable": "value","X": {"nested_var": 985,"nested_var2": 85}}
{"Name": "2","variable": "value","X": {"nested_var": 45,"nested_var2": 77}}

I want to only read instances where name = 1

Upvotes: 1

Views: 2247

Answers (1)

jezrael
jezrael

Reputation: 863301

You can use loop by each file, each line and append filtered rows to list, last use DataFrame contructor:

data = []
for file in glob.glob('logs/*.json'):
    with open(file) as f:
        for line in f:
            if json.loads(line)['Name'] == '1':
                data.append(json.loads(line))

df = pd.DataFrame(data)

Upvotes: 1

Related Questions