Fra_S
Fra_S

Reputation: 55

chunksize isn't starting from first row in csv file

Using Python 3.

I have a very large CSV file that I need to split and save to_csv. I use chunksize parameter to determine how many rows I need in both files. Expectation is the first code should read required rows so I can save it into first CSV file and the second should take care of remaining rows so I can save them in second CSV file:

As an example, let's say file is 3000 rows and using below code :

file = pd.read_csv(r'file.csv',index_col=None, header='infer', encoding='ISO-8859-1',skiprows=None, chunksize=500)

I've used skiprows=None as I want it to start from the beginning and chunk the first 500.

Then, second code should skip previous 500 and chunk remaining:

file = pd.read_csv(r'file.csv',index_col=None, header='infer', encoding='ISO-8859-1',skiprows=500, chunksize=2500)

However, the result I get from first code is that it always goes directly and chunk the last 500 and not starting from beginning as expected. It doesn't sound that skiprows is working as expected if chunksize will always skip to the last given number.

Would appreciate any kind of suggestion on what might be going on here.

Upvotes: 4

Views: 1816

Answers (2)

JohnE
JohnE

Reputation: 30434

It sounds like you don't really need chunksize at all if I understand what you are trying to do. Here's code that reads the first 500 lines into df1 and the rest into df2, and then combines into a single dataframe, in case you want to do that also.

rows = 500

df1 = pd.read_csv( 'test.csv', nrows   =rows )
df2 = pd.read_csv( 'test.csv', skiprows=rows+1, names=df1.columns )

df3 = pd.concat( [df1,df2] ).reset_index(drop=True)

If you just want to read the original file and output 2 new csv files without creating any intermediate dataframes, perhaps this is what you want?

names = pd.read_csv( 'test.csv', nrows = 2 ).columns   # store column names

pd.read_csv( 'test.csv', nrows    = rows                ).to_csv('foo1.csv')
pd.read_csv( 'test.csv', skiprows = rows+1, names=names ).to_csv('foo2.csv')

Upvotes: 1

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

As soon as you use not default (not None) value for chunksize parameter pd.read_csv returns a TextFileReader iterator instead of a DataFrame. pd.read_csv() will try to read your CSV file in chunks (with specified chunk size):

reader = pd.read_csv(filename, chunksize=N)
for df in reader:
    # process df (chunk) here

So when using chunksize - all chunks (except the very last one) will have the same length. Using iterator parameter you can define how much data (get_chunk(nrows)) you want to read in each iteration:

In [66]: reader = pd.read_csv(fn, iterator=True)

let's read first 3 rows

In [67]: reader.get_chunk(3)
Out[67]:
          a         b         c
0  2.229657 -1.040086  1.295774
1  0.358098 -1.080557 -0.396338
2  0.731741 -0.690453  0.126648

now we'll read next 5 rows:

In [68]: reader.get_chunk(5)
Out[68]:
          a         b         c
0 -0.009388 -1.549381  0.913128
1 -0.256654 -0.073549 -0.171606
2  0.849934  0.305337  2.360101
3 -1.472184  0.641512 -1.301492
4 -2.302152  0.417787  0.485958

next 7 rows:

In [69]: reader.get_chunk(7)
Out[69]:
          a         b         c
0  0.492314  0.603309  0.890524
1 -0.730400  0.835873  1.313114
2  1.393865 -1.115267  1.194747
3  3.038719 -0.343875 -1.410834
4 -1.510598  0.664154 -0.996762
5 -0.528211  1.269363  0.506728
6  0.043785 -0.786499 -1.073502

from docs:

iterator : boolean, default False

Return TextFileReader object for iteration or getting chunks with get_chunk().

chunksize : int, default None

Return TextFileReader object for iteration. See the IO Tools docs for more information on iterator and chunksize.

Upvotes: 3

Related Questions