Reputation: 220
I have a very large CSV file that I read via iteration with pandas' chunks function. The problem: If e.g. chunksize=2, it skips the first 2 rows and the first chunks I receive are row 3-4.
Basically, if I read the CSV with nrows=4, I get the first 4 rows while chunking the same file with chunksize=2 gets me first row 3 and 4, then 5 and 6, ...
#1. Read with nrows
#read first 4 rows in csv files and merge date and time column to be used as index
reader = pd.read_csv('filename.csv', delimiter=',', parse_dates={"Datetime" : [1,2]}, index_col=[0], nrows=4)
print (reader)
01/01/2016 - 09:30 - A - 100
01/01/2016 - 13:30 - A - 110
01/01/2016 - 15:30 - A - 120
02/01/2016 - 10:30 - A - 115
#2. Iterate over csv file with chunks
#iterate over csv file in chunks and merge date and time column to be used as index
reader = pd.read_csv('filename.csv', delimiter=',', parse_dates={"Datetime" : [1,2]}, index_col=[0], chunksize=2)
for chunk in reader:
#create a dataframe from chunks
df = reader.get_chunk()
print (df)
01/01/2016 - 15:30 - A - 120
02/01/2016 - 10:30 - A - 115
Increasing chunksize to 10 skips first 10 rows.
Any ideas how I can fix this? I already got a workaround that works, I'd like to understand where I got it wrong.
Any input is appreciated!
Upvotes: 1
Views: 1720
Reputation: 33803
Don't call get_chunk
. You already have your chunk since you're iterating over the reader, i.e. chunk
is your DataFrame. Call print(chunk)
in your loop, and you should see the expected output.
As @MaxU points out in the comments, you want to use get_chunk
if you want differently sized chunks: reader.get_chunk(500)
, reader.get_chunk(100)
, etc.
Upvotes: 4