Reputation: 5713
My csv file contains 6Million records and I am trying to split it into multiple smaller sized files by using skiprows.. My Pandas version is '0.12.0' and the code is
pd.read_csv(TRAIN_FILE, chunksize=50000, header=None, skiprows=999999, nrows=100000)
It works as long as skiprows is less than 900000. Any idea if it is expected ? If I do not use skiprows, my nrows can go upto 5Million records. Have not yet tried beyond that. will try this also.
tried csv splitter, but it does not work properly for the first entry, may be, because, each cell consists of multiple lines of code etc.
EDIT: I am able to split it into csv by reading the entire 7GB file using pandas read_csv and writing in parts to multiple csv files.
Upvotes: 6
Views: 565
Reputation: 911
The problem seems to be that you are specifying both nrows
and chunksize
. At least in pandas 0.14.0 using
pandas.read_csv(filename, nrows=some_number, chunksize=another_number)
returns a Dataframe
(reading the whole data), whereas
pandas.read_csv(filename, chunksize=another_number)
returns a TextFileReader that loads the file lazily.
Splitting a csv then works like this:
for chunk in pandas.read_csv(filename, chunksize=your_chunk_size):
chunk.to_csv(some_filename)
Upvotes: 1