Reputation: 101
I have a 1GB, 70M row file which anytime I load it all it runs out of memory. I have read in 1000 rows and been able to prototype what I'd like it to do.
My problem is not knowing how to get the next 1000 rows and apply my logic and then continue to run through my file until it finishes the last rows. I've read about chunksizing, although I can't figure out how to continue the iteration of the chunksizing.
Ideally, it would flow like such:
1)read in first 1000 rows 2)filter data based on criteria 3)write to csv 4)repeat until no more rows
Here's what i have so far:
import pandas as pd
data=pd.read_table('datafile.txt',sep='\t',chunksize=1000, iterator=True)
data=data[data['visits']>10]
with open('data.csv', 'a') as f:
data.to_csv(f,sep = ',', index=False, header=False)
Upvotes: 5
Views: 17903
Reputation: 29710
When you pass a chunksize
or iterator=True
, pd.read_table
returns a TextFileReader that you can iterate over or call get_chunk
on. So you need to iterate or call get_chunk
on data
.
So proper handling of your entire file might look something like
import pandas as pd
data = pd.read_table('datafile.txt',sep='\t',chunksize=1000, iterator=True)
with open('data.csv', 'a') as f:
for chunk in data:
chunk[chunk.visits > 10].to_csv(f, sep=',', index=False, header=False)
Upvotes: 3
Reputation: 24945
You have some problems with your logic, we want to loop over each chunk in the data, not the data itself.
The 'chunksize' argument gives us a 'textreader object' that we can iterate over.
import pandas as pd
data=pd.read_table('datafile.txt',sep='\t',chunksize=1000)
for chunk in data:
chunk = chunk[chunk['visits']>10]
chunk.to_csv('data.csv', index = False, header = False)
You will need to think about how to handle your header!
Upvotes: 10