Thales Vilela
Thales Vilela

Reputation: 41

Construct DataFrame from multiple JSON files

I'm using pandas to convert multiple json files into a dataframe. I only want some entries that match some criteria from those files, but I'm appending the whole converted files, then filtering it.

Suppose I have 2 json files that look like this:

File 1500.json

[
  {
    "CodStore": 1500,
    "CodItem": 10,
    "NameItem": "Burger",
    "Price": 10.0
  },
  {
    "CodStore": 1500,
    "CodItem": 20,
    "NameItem": "Fries",
    "Price": 3.0
  },
  {
    "CodStore": 1500,
    "CodItem": 30,
    "NameItem": "Ice Cream",
    "Price": 1.0
  }
]

File 1805.json

[
  {
    "CodStore": 1805,
    "CodItem": 10,
    "NameItem": "Burger",
    "Price": 9.0
  },
  {
    "CodStore": 1805,
    "CodItem": 20,
    "NameItem": "Fries",
    "Price": 2.0
  },
  {
    "CodStore": 1805,
    "CodItem": 30,
    "NameItem": "Ice Cream",
    "Price": 0.5
  }
]

I only want entries with CodItem 10 and 30 on my dataframe, so my python code looks like this:

from pandas import DataFrame, read_json

df = DataFrame()

stores = [1500, 1805]

for store in stores:
    filename = '%s.json' % store
    df = df.append(read_json(filename))

df = df[(df.CodItem == 10) | (df.CodItem == 30)]

This is just an example, the problem is that I have more than 600+ json files so reading takes a lot of time, the dataframe becomes very long and memory consumption is very high.

Is there a way to read only the matching criteria to the dataframe?

Upvotes: 1

Views: 2114

Answers (2)

cs95
cs95

Reputation: 402333

One option would be to append your JSON data to a list, then convert once at the end and filter.

coditems = [10, 30]

data = []
for filename in json_files:
    data.extend(read_json(filename))

df = pd.DataFrame(data).query('CodItem in @coditems')

This should be a lot faster because append is a quadratic operation. You have to read all the data in anyway, so you may as well use pandas to speed it up.


Another option would be to initialise your DataFrames inside a loop and then call pd.concat after you're done.

df_list = []
for file in json_files:
    df_list.append(pd.DataFrame.from_records(read_json(filename)))

df = pd.concat(df_list, ignore_index=True).query('CodItem in @coditems')

Upvotes: 2

foglerit
foglerit

Reputation: 8269

You can can create a temporary data frame within your loop and filter it before appending:

from pandas import DataFrame, read_json

df = DataFrame()

stores = [1500, 1805]

for store in stores:
    filename = '%s.json' % store
    temp_df = read_json(filename)
    df = df.append(temp_df[(temp_df.CodItem == 10) | (temp_df.CodItem == 30)])

Upvotes: 0

Related Questions