Meng zhao
Meng zhao

Reputation: 221

Pandas MemoryError when reading large CSV followed by `.iloc` slicing columns

I've been trying to process a 1.4GB CSV file with Pandas, but keep having memory problems. I have tried different things in attempt to make Pandas read_csv work to no avail.

The code may not make sense but that's because I removed the part where it writes into an SQL database to simplify it and isolate the problem.

import csv,pandas as pd
import glob
filenameStem = 'Crimes'
counter = 0
for filename in glob.glob(filenameStem + '_part*.csv'): # reading files Crimes_part1.csv through Crimes_part6.csv
    chunk = pd.read_csv(filename)
    df = chunk.iloc[:,[5,8,15,16]]
    df = df.dropna(how='any')
    counter += 1
    print(counter)

Upvotes: 1

Views: 1697

Answers (3)

Nisha Rajnor
Nisha Rajnor

Reputation: 1

I have found same issues in csv file. First to make csv as chunks and fix the chunksize.use the chunksize or iterator parameter to return the data in chunks. Syntax:

csv_onechunk = padas.read_csv(filepath, sep = delimiter, skiprows = 1, chunksize = 10000)

then concatenate the chunks (Only valid with C parser)

Upvotes: 0

Meng zhao
Meng zhao

Reputation: 221

Thanks for the reply.

After some debugging, I have located the problem. The "iloc" subsetting of pandas created a circular reference, which prevented garbage recollection. Detailed discussion can be found here

Upvotes: 1

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210922

you may try to parse only those columns that you need (as @BrenBarn said in comments):

import os
import glob
import pandas as pd

def get_merged_csv(flist, **kwargs):
    return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)

fmask = 'Crimes_part*.csv'
cols = [5,8,15,16]

df = get_merged_csv(glob.glob(fmask), index_col=None, usecols=cols).dropna(how='any')

print(df.head())

PS this will include only 4 out of at least 17 columns in your resulting data frame

Upvotes: 1

Related Questions