P.Cummings
P.Cummings

Reputation: 101

Pandas Chunksize iterator

I have a 1GB, 70M row file which anytime I load it all it runs out of memory. I have read in 1000 rows and been able to prototype what I'd like it to do.

My problem is not knowing how to get the next 1000 rows and apply my logic and then continue to run through my file until it finishes the last rows. I've read about chunksizing, although I can't figure out how to continue the iteration of the chunksizing.

Ideally, it would flow like such:

1)read in first 1000 rows 2)filter data based on criteria 3)write to csv 4)repeat until no more rows

Here's what i have so far:

import pandas as pd
data=pd.read_table('datafile.txt',sep='\t',chunksize=1000, iterator=True)
data=data[data['visits']>10]
with open('data.csv', 'a') as f:
    data.to_csv(f,sep = ',', index=False, header=False)

Upvotes: 5

Views: 17903

Answers (2)

miradulo
miradulo

Reputation: 29710

When you pass a chunksize or iterator=True, pd.read_table returns a TextFileReader that you can iterate over or call get_chunk on. So you need to iterate or call get_chunk on data.

So proper handling of your entire file might look something like

import pandas as pd

data = pd.read_table('datafile.txt',sep='\t',chunksize=1000, iterator=True)

with open('data.csv', 'a') as f:
    for chunk in data:
        chunk[chunk.visits > 10].to_csv(f, sep=',', index=False, header=False)

Upvotes: 3

jeremycg
jeremycg

Reputation: 24945

You have some problems with your logic, we want to loop over each chunk in the data, not the data itself.

The 'chunksize' argument gives us a 'textreader object' that we can iterate over.

import pandas as pd
data=pd.read_table('datafile.txt',sep='\t',chunksize=1000)

for chunk in data:
    chunk = chunk[chunk['visits']>10]
    chunk.to_csv('data.csv', index = False, header = False)

You will need to think about how to handle your header!

Upvotes: 10

Related Questions