wallisers
wallisers

Reputation: 388

Concatenating huge dataframes with pandas

I have sensor data recorded over a timespan of one year. The data is stored in twelve chunks, with 1000 columns, ~1000000 rows each. I have worked out a script to concatenate these chunks to one large file, but about half way through the execution I get a MemoryError. (I am running this on a machine with ~70 GB of usable RAM.)

import gc
from os import listdir
import pandas as pd

path = "/slices02/hdf/"
slices = listdir(path)
res = pd.DataFrame()

for sl in slices:
    temp = pd.read_hdf(path + f"{sl}")
    res = pd.concat([res, temp], sort=False, axis=1)
    del temp
    gc.collect()
res.fillna(method="ffill", inplace=True)
res.to_hdf(path + "sensor_data_cpl.hdf", "online", mode="w")

I have also tried to fiddle with HDFStore so I do not have to load all the data into memory (see Merging two tables with millions of rows in Python), but I could not figure out how that works in my case.

Upvotes: 0

Views: 752

Answers (2)

Luis Blanche
Luis Blanche

Reputation: 627

When you read in a csv as a pandas DataFrame, the process will take up to twice the needed memory at the end (because of type guessing and all the automatic stuff pandas tries to provide).

Several methods to fight that :

  1. Use chunks. I see that your data is already in chunks, but maybe those are too big, so you can read each files by chunks using the chunk_size parameter of pandas.read_hdf or pandas.read_csv

  2. Provide dtypes to avoid type guessing and mixed types (ex: a column of strings with null value with have mixed type), this will work along low_memory parameters.

If this is not sufficient you'll have to turn to distributed technologies like pyspark, dask, modin or even pandarallel

Upvotes: 2

NotAName
NotAName

Reputation: 4322

When you have so much data avoid creating temporary dataframes as they take up memory too. Try doing it in one pass:

folder = "/slices02/hdf/"
files = [os.path.join(folder, file) for file in os.listdir(folder)]
res = pd.concat((pd.read_csv(file) for file in files), sort=False)

See how this works for you.

Upvotes: 0

Related Questions