Python help lazy loading large data sets

Question

I have a script which loads data into a database. I grab data from csv files and return a generator. The values yielded from the generator are used to build insert statements for bulk loading up to a 100K records at a time. Currently I have a function that looks like this to build a "list of lists":

def split_statements(data_set, num_of_splits):
        return iter([data_list[pos:pos + num_of_splits] for pos in xrange(0, len(data_list), num_of_splits)])

This works fine for 1 line up to several million lines of data, splitting the latter into 100K chunks. However I have been trying to switch to some sort of generator / lazy loading function for edge cases of extremely large files. I tried the following:

def split_statements(data_set, num_of_splits):
    for i in data_set:
        yield list(islice(data_set, num_of_splits))

This seems clunky and does not work when there is only 1 line in a file. However it worked great on a 10GB file.

Would appreciate any insight/help.

Thank you!

Python help lazy loading large data sets

Answers (1)

Related Questions