Pandas Chunksize iterator

Question

I have a 1GB, 70M row file which anytime I load it all it runs out of memory. I have read in 1000 rows and been able to prototype what I'd like it to do.

My problem is not knowing how to get the next 1000 rows and apply my logic and then continue to run through my file until it finishes the last rows. I've read about chunksizing, although I can't figure out how to continue the iteration of the chunksizing.

Ideally, it would flow like such:

1)read in first 1000 rows 2)filter data based on criteria 3)write to csv 4)repeat until no more rows

Here's what i have so far:

import pandas as pd
data=pd.read_table('datafile.txt',sep='	',chunksize=1000, iterator=True)
data=data[data['visits']>10]
with open('data.csv', 'a') as f:
    data.to_csv(f,sep = ',', index=False, header=False)

jeremycg · Accepted Answer

You have some problems with your logic, we want to loop over each chunk in the data, not the data itself.

The 'chunksize' argument gives us a 'textreader object' that we can iterate over.

import pandas as pd
data=pd.read_table('datafile.txt',sep='	',chunksize=1000)

for chunk in data:
    chunk = chunk[chunk['visits']>10]
    chunk.to_csv('data.csv', index = False, header = False)

You will need to think about how to handle your header!

Pandas Chunksize iterator

Answers (2)

Related Questions