user982599
user982599

Reputation: 975

Python help lazy loading large data sets

I have a script which loads data into a database. I grab data from csv files and return a generator. The values yielded from the generator are used to build insert statements for bulk loading up to a 100K records at a time. Currently I have a function that looks like this to build a "list of lists":

def split_statements(data_set, num_of_splits):
        return iter([data_list[pos:pos + num_of_splits] for pos in xrange(0, len(data_list), num_of_splits)])

This works fine for 1 line up to several million lines of data, splitting the latter into 100K chunks. However I have been trying to switch to some sort of generator / lazy loading function for edge cases of extremely large files. I tried the following:

def split_statements(data_set, num_of_splits):
    for i in data_set:
        yield list(islice(data_set, num_of_splits))

This seems clunky and does not work when there is only 1 line in a file. However it worked great on a 10GB file.

Would appreciate any insight/help.

Thank you!

Upvotes: 0

Views: 668

Answers (1)

Stefan Pochmann
Stefan Pochmann

Reputation: 28606

I doubt it really worked on the 10GB file. I don't know your data_set, but I think the for i in data_set will always read the next element from data_set and make it available as i, which you then ignore. That would explain why the 1 line file didn't work. The 10GB file probably also didn't work and is missing all those lines that were wasted as i.

Demo:

from itertools import islice
it = iter('abcabcabc')
for i in it:
    print(list(islice(it, 2)))

Output:

['b', 'c']
['b', 'c']
['b', 'c']

Upvotes: 1

Related Questions