Reputation: 37
I wrote a small simple script to read and process a huge CSV file (~150GB), which reads 5e6 rows per loop, converts it to a Pandas DataFrame, do something with it, and then keeps reading the next 5e6 rows.
Albeit it does the job, at every iteration it takes longer to find the next chunk of rows to read, as it has to skip larger number of rows. I read many answers regarding the use of chunk (as a reader iterator), although once the chunk has been read I would then need to concatenate the chunks to create a DataFrame (with all sort of issues regarding truncated rows and stuff), so I prefer not to go down that road.
Is it possible to use some kind of cursor to remind the read_csv function to start reading from where it stopped?
The main part of the code looks like this:
while condition is True:
df = pd.read_csv(inputfile, sep=',', header = None, skiprows = sr, nrows = 5e6)
# do something with df
sr = sr + 5e6
# if something goes wrong the condition turns False
Upvotes: 1
Views: 2692
Reputation: 210972
Using your approach Pandas will have to start reading this huge CSV file from the very beginning again and again in order to skip rows...
I think you do want to use chunksize
parameter:
reader = pd.read_csv(inputfile, sep=',', header=None, chunksize=5*10**6)
for df in reader:
# do something with df
if (something goes wrong):
break
Upvotes: 4