user644745
user644745

Reputation: 5713

Pandas skiprows beyond 900000 fails

My csv file contains 6Million records and I am trying to split it into multiple smaller sized files by using skiprows.. My Pandas version is '0.12.0' and the code is

pd.read_csv(TRAIN_FILE, chunksize=50000, header=None, skiprows=999999, nrows=100000)

It works as long as skiprows is less than 900000. Any idea if it is expected ? If I do not use skiprows, my nrows can go upto 5Million records. Have not yet tried beyond that. will try this also.

tried csv splitter, but it does not work properly for the first entry, may be, because, each cell consists of multiple lines of code etc.

EDIT: I am able to split it into csv by reading the entire 7GB file using pandas read_csv and writing in parts to multiple csv files.

Upvotes: 6

Views: 565

Answers (1)

Matthias Ossadnik
Matthias Ossadnik

Reputation: 911

The problem seems to be that you are specifying both nrows and chunksize. At least in pandas 0.14.0 using

pandas.read_csv(filename, nrows=some_number, chunksize=another_number)

returns a Dataframe (reading the whole data), whereas

pandas.read_csv(filename, chunksize=another_number)

returns a TextFileReader that loads the file lazily.

Splitting a csv then works like this:

for chunk in pandas.read_csv(filename, chunksize=your_chunk_size):
    chunk.to_csv(some_filename)

Upvotes: 1

Related Questions