I've been trying to process a 1.4GB CSV file with Pandas, but keep having memory problems. I have tried different things in attempt to make Pandas read_csv work to no avail. It didn't work when I used the iterator=True and chunksize=number parameters. Moreover, the smaller the chunksize , the slower it is to process the same amount of data. (Simple heavier overhead doesn't explain it because it was way too slower when number of chunks is big. I suspect when processing every chunk, panda needs to go though all the chunks before it to "get to it", instead of jumping right to the start of the chunk. This seems the only way this can be explained.) Then as a last resort, I split the CSV files into 6 parts, and tried to read them one by one, but still get MemoryError. (I have monitored the memory usage of python when running the code below, and found that each time python finishes processing a file and moves on to the next, the memory usage goes up. It seemed quite obvious that panda didn't release memory for the previous file when it's already finished processing it.) The code may not make sense but that's because I removed the part where it writes into an SQL database to simplify it and isolate the problem. import csv,pandas as pd import glob filenameStem = 'Crimes' counter = 0 for filename in glob.glob(filenameStem + '_part*.csv'): # reading files Crimes_part1.csv through Crimes_part6.csv chunk = pd.read_csv(filename) df = chunk.iloc[:,[5,8,15,16]] df = df.dropna(how='any') counter += 1 print(counter)

Reputation: 221

Pandas MemoryError when reading large CSV followed by `.iloc` slicing columns

I've been trying to process a 1.4GB CSV file with Pandas, but keep having memory problems. I have tried different things in attempt to make Pandas read_csv work to no avail.

It didn't work when I used the iterator=True and chunksize=number parameters. Moreover, the smaller the chunksize, the slower it is to process the same amount of data.
- (Simple heavier overhead doesn't explain it because it was way too slower when number of chunks is big. I suspect when processing every chunk, panda needs to go though all the chunks before it to "get to it", instead of jumping right to the start of the chunk. This seems the only way this can be explained.)
Then as a last resort, I split the CSV files into 6 parts, and tried to read them one by one, but still get MemoryError.
- (I have monitored the memory usage of python when running the code below, and found that each time python finishes processing a file and moves on to the next, the memory usage goes up. It seemed quite obvious that panda didn't release memory for the previous file when it's already finished processing it.)

The code may not make sense but that's because I removed the part where it writes into an SQL database to simplify it and isolate the problem.

import csv,pandas as pd
import glob
filenameStem = 'Crimes'
counter = 0
for filename in glob.glob(filenameStem + '_part*.csv'): # reading files Crimes_part1.csv through Crimes_part6.csv
    chunk = pd.read_csv(filename)
    df = chunk.iloc[:,[5,8,15,16]]
    df = df.dropna(how='any')
    counter += 1
    print(counter)

Upvotes: 1

Answers (3)

Nisha Rajnor

Reputation: 1

I have found same issues in csv file. First to make csv as chunks and fix the chunksize.use the chunksize or iterator parameter to return the data in chunks. Syntax:

csv_onechunk = padas.read_csv(filepath, sep = delimiter, skiprows = 1, chunksize = 10000)

then concatenate the chunks (Only valid with C parser)

Upvotes: 0

Meng zhao

Reputation: 221

Thanks for the reply.

After some debugging, I have located the problem. The "iloc" subsetting of pandas created a circular reference, which prevented garbage recollection. Detailed discussion can be found here

Upvotes: 1

MaxU - stand with Ukraine

Reputation: 210922

you may try to parse only those columns that you need (as @BrenBarn said in comments):

import os
import glob
import pandas as pd

def get_merged_csv(flist, **kwargs):
    return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)

fmask = 'Crimes_part*.csv'
cols = [5,8,15,16]

df = get_merged_csv(glob.glob(fmask), index_col=None, usecols=cols).dropna(how='any')

print(df.head())

PS this will include only 4 out of at least 17 columns in your resulting data frame

Upvotes: 1

Pandas MemoryError when reading large CSV followed by `.iloc` slicing columns

Answers (3)

Related Questions