Reputation: 55
Using Python 3.
I have a very large CSV file that I need to split and save to_csv. I use chunksize parameter to determine how many rows I need in both files. Expectation is the first code should read required rows so I can save it into first CSV file and the second should take care of remaining rows so I can save them in second CSV file:
As an example, let's say file is 3000 rows and using below code :
file = pd.read_csv(r'file.csv',index_col=None, header='infer', encoding='ISO-8859-1',skiprows=None, chunksize=500)
I've used skiprows=None as I want it to start from the beginning and chunk the first 500.
Then, second code should skip previous 500 and chunk remaining:
file = pd.read_csv(r'file.csv',index_col=None, header='infer', encoding='ISO-8859-1',skiprows=500, chunksize=2500)
However, the result I get from first code is that it always goes directly and chunk the last 500 and not starting from beginning as expected. It doesn't sound that skiprows is working as expected if chunksize will always skip to the last given number.
Would appreciate any kind of suggestion on what might be going on here.
Upvotes: 4
Views: 1816
Reputation: 30434
It sounds like you don't really need chunksize at all if I understand what you are trying to do. Here's code that reads the first 500 lines into df1 and the rest into df2, and then combines into a single dataframe, in case you want to do that also.
rows = 500
df1 = pd.read_csv( 'test.csv', nrows =rows )
df2 = pd.read_csv( 'test.csv', skiprows=rows+1, names=df1.columns )
df3 = pd.concat( [df1,df2] ).reset_index(drop=True)
If you just want to read the original file and output 2 new csv files without creating any intermediate dataframes, perhaps this is what you want?
names = pd.read_csv( 'test.csv', nrows = 2 ).columns # store column names
pd.read_csv( 'test.csv', nrows = rows ).to_csv('foo1.csv')
pd.read_csv( 'test.csv', skiprows = rows+1, names=names ).to_csv('foo2.csv')
Upvotes: 1
Reputation: 210882
As soon as you use not default (not None
) value for chunksize
parameter pd.read_csv
returns a TextFileReader
iterator instead of a DataFrame. pd.read_csv()
will try to read your CSV file in chunks (with specified chunk size):
reader = pd.read_csv(filename, chunksize=N)
for df in reader:
# process df (chunk) here
So when using chunksize
- all chunks (except the very last one) will have the same length. Using iterator
parameter you can define how much data (get_chunk(nrows)
) you want to read in each iteration:
In [66]: reader = pd.read_csv(fn, iterator=True)
let's read first 3 rows
In [67]: reader.get_chunk(3)
Out[67]:
a b c
0 2.229657 -1.040086 1.295774
1 0.358098 -1.080557 -0.396338
2 0.731741 -0.690453 0.126648
now we'll read next 5 rows:
In [68]: reader.get_chunk(5)
Out[68]:
a b c
0 -0.009388 -1.549381 0.913128
1 -0.256654 -0.073549 -0.171606
2 0.849934 0.305337 2.360101
3 -1.472184 0.641512 -1.301492
4 -2.302152 0.417787 0.485958
next 7 rows:
In [69]: reader.get_chunk(7)
Out[69]:
a b c
0 0.492314 0.603309 0.890524
1 -0.730400 0.835873 1.313114
2 1.393865 -1.115267 1.194747
3 3.038719 -0.343875 -1.410834
4 -1.510598 0.664154 -0.996762
5 -0.528211 1.269363 0.506728
6 0.043785 -0.786499 -1.073502
from docs:
iterator : boolean, default False
Return TextFileReader object for iteration or getting chunks with get_chunk().
chunksize : int, default None
Return
TextFileReader
object for iteration. See the IO Tools docs for more information oniterator
andchunksize
.
Upvotes: 3