Reputation: 1202
I have a huge JSON
file (lots of smaller .log
(JSON
format) files combined together to a total of 8Gb), composed of multiple different objects (where every object takes a row). I want to read this file into a pandas dataframe
. I am only interested in collecting the JSON
entries for one specific object (this would drastically reduce the file size to read). Can this be done with pandas
or python
before reading in a dataframe
?
My current code is as follows:
import pandas as pd
import glob
df = pd.concat([pd.read_json(f, encoding = "ISO-8859-1", lines=True) for f in glob.glob("logs/sample1/*.log")], ignore_index=True)
As you might imagine, this is very computationally heavy, and takes a lot of time to complete. Is there a way to process this before reading it in a dataframe
?
Sample of Data:
{"Name": "1","variable": "value","X": {"nested_var": 5000,"nested_var2": 2000}}
{"Name": "2","variable": "value","X": {"nested_var": 1222,"nested_var2": 8465}}
{"Name": "2","variable": "value","X": {"nested_var": 123,"nested_var2": 865}}
{"Name": "1","variable": "value","X": {"nested_var": 5500,"nested_var2": 2070}}
{"Name": "2","variable": "value","X": {"nested_var": 985,"nested_var2": 85}}
{"Name": "2","variable": "value","X": {"nested_var": 45,"nested_var2": 77}}
I want to only read instances where name = 1
Upvotes: 1
Views: 2247
Reputation: 863301
You can use loop by each file, each line and append filtered rows to list, last use DataFrame
contructor:
data = []
for file in glob.glob('logs/*.json'):
with open(file) as f:
for line in f:
if json.loads(line)['Name'] == '1':
data.append(json.loads(line))
df = pd.DataFrame(data)
Upvotes: 1