Reputation: 975
I have a script which loads data into a database. I grab data from csv files and return a generator. The values yielded from the generator are used to build insert statements for bulk loading up to a 100K records at a time. Currently I have a function that looks like this to build a "list of lists":
def split_statements(data_set, num_of_splits):
return iter([data_list[pos:pos + num_of_splits] for pos in xrange(0, len(data_list), num_of_splits)])
This works fine for 1 line up to several million lines of data, splitting the latter into 100K chunks. However I have been trying to switch to some sort of generator / lazy loading function for edge cases of extremely large files. I tried the following:
def split_statements(data_set, num_of_splits):
for i in data_set:
yield list(islice(data_set, num_of_splits))
This seems clunky and does not work when there is only 1 line in a file. However it worked great on a 10GB file.
Would appreciate any insight/help.
Thank you!
Upvotes: 0
Views: 668
Reputation: 28606
I doubt it really worked on the 10GB file. I don't know your data_set
, but I think the for i in data_set
will always read the next element from data_set
and make it available as i
, which you then ignore. That would explain why the 1 line file didn't work. The 10GB file probably also didn't work and is missing all those lines that were wasted as i
.
Demo:
from itertools import islice
it = iter('abcabcabc')
for i in it:
print(list(islice(it, 2)))
Output:
['b', 'c']
['b', 'c']
['b', 'c']
Upvotes: 1