Reputation: 5565
I have a csv file with many millions of rows. I want to start iterating from the 10,000,000 row. At the moment I have the code:
with open(csv_file, encoding='UTF-8') as f:
r = csv.reader(f)
for row_number, row in enumerate(r):
if row_number < 10000000:
continue
else:
process_row(row)
This works, however take several seconds to run before the rows of interest appear. Presumably all the unrequired rows are loaded into python unnecessarily, slowing it down. Is there a way of starting the iteration process on a certain row - i.e. without the start of the data read in.
Upvotes: 2
Views: 1442
Reputation: 180391
You could use islice:
from itertools import islice
with open(csv_file, encoding='UTF-8') as f:
r = csv.reader(f)
for row in islice(r, 10000000, None):
process_row(row)
It still iterates over all the rows but does it a lot more efficiently.
You could also use the consume recipe which calls functions that consume iterators at C speed, calling it on the file object before you pass it to the csv.reader, so you also avoid needlessly processing those lines with the reader:
import collections
from itertools import islice
def consume(iterator, n):
"Advance the iterator n-steps ahead. If n is none, consume entirely."
# Use functions that consume iterators at C speed.
if n is None:
# feed the entire iterator into a zero-length deque
collections.deque(iterator, maxlen=0)
else:
# advance to the empty slice starting at position n
next(islice(iterator, n, n), None)
with open(csv_file, encoding='UTF-8') as f:
consume(f, 9999999)
r = csv.reader(f)
for row in r:
process_row(row)
As Shadowranger commented, if a file could conatin embedded newlines then you would have to consume the reader and pass newline=""
but if that is not the case then use do consume the file object as the performance difference will be considerable especially if you have a lot of columns.
Upvotes: 5