Reputation: 4489

what is the most efficient way to iterate over a very large table and update the rows?

Problem:

When using the built-in SQLite3 database in Python (3.6), what are the considerations for iterating over a very large database table and updating it row by row?

Requirements:

I'll need to iterate each row and use information from a column to perform an action and then update a second column for that row with the result. I can't do this on the fly, I need to have the table fully updated.

Additionally, based on the size of the database and the amount of time it will take to perform each action, this is expected to run for multiple days. With that in mind, it needs to be fault tolerant with the ability to periodically commit the changes.

Question

I am looking at the different fetch methods first but I'd like to know what the memory considerations are with the different methods. I've seen for external databases on another server, when you do a fetchmany() the database is generating the select statement with all of your results but only returning arraysize results at a time -- is this the case with SQLite as well? If so, is there any benefit to using fetchmany() vs fetchall() vs fetchone()?

I plan to use a generator function with executemany() in order to update the database and commit periodically, assuming that I can fetch in chunks in order to break the generator function in intervals to cause it to commit.

Example:

table:
  columns: id, value1, updated_value


def action_function():
    next_id = next_row()
    updated_value = compute_value(id)
    yield (updated_value, next_id)

def next_row():
    while True:
        results = cur.fetchmany(arraysize)
        if not results:
            break
        for result in results:
            yield result[0]

cur.executemany('''UPDATE table SET updated_value = ? WHERE id = ?''', action_function())

Is that around what this should look like? I think that possibly the functions should be reversed in order to cause the chunking to commit every arraysize?

In the end, the database has millions of rows and needs to compute one action per row (using a value from that same row) and then update a column. Loading the database into memory will likely not work so fetchall() seems out, but considering how standard SQL databases work where a SELECT statement is loading the results into memory anyway, is that happening with SQLite3 in Python? If not, what is happening?

Is there a better way I should be doing this such as simply holding variables for start_rowid and stop_rowid for each chunk and then calculating the next chunk size and doing a SELECT statement just for those indexes?

Thanks for the help!

Edit:

As an alternative option, there is the create_function which goes row by row.

Something like this:

con.create_function("action", 1, compute_value)
cur.execute("UPDATE table SET updated_value = action(id)")

This answer suggests to use this method, but considering that it needs to be at least somewhat fault tolerant I don't think this method will work, and the time to complete will still be large if only due to the time for compute_value to do its thing.

If I used this method and the process was interrupted, would the values computed so far be committed to the database or would they be lost?

Upvotes: 1

Answers (3)

MyNameIsCaleb

Reputation: 4489

I am looking at the different fetch methods first but I'd like to know what the memory considerations are with the different methods. I've seen for external databases on another server, when you do a fetchmany() the database is generating the select statement with all of your results but only returning arraysize results at a time -- is this the case with SQLite as well? If so, is there any benefit to using fetchmany() vs fetchall() vs fetchone()?

To start out, SQLite Docs talk through how a query works on the backend of SQLite. To sum it up, once a SELECT statement is initiated, a sqlite3_stmt is prepared. This holds the instructions to retrieve your results. To get the next row of your results, sqlite3_step() is called until the next result row is ready.

So it follows that on the Python interface, when you do fetchone() it is running the step() one time. When you do a fetchmany(), it loops through step() until the amount of results is equal to your array size attribute, and puts them in a Python list object. And fetchall() is looping until there are no results left and again makes them a Python list object. So there is a benefit for memory because the resulting list object will be a different size based on the number of results. Your cursor object will be a static size always.

Is there a better way I should be doing this such as simply holding variables for start_rowid and stop_rowid for each chunk and then calculating the next chunk size and doing a SELECT statement just for those indexes?

Doing this method would take more time to close out the statement and then prepare the next statement, rather than do a single SELECT and then use fetchmany() to retrieve chunks of the data at a time

As far as the rest of my example goes, I decided to go with a multiprocessing method (suggested by Ente) to make the best use of resources and will be using a single SELECT with fetchone() to add to the Queue. I'll have multiple worker processes pulling from the Queue and sending API calls, and then adding the result to a DoneQueue. I'll have final process chunking through the DoneQueue and doing executemany() with UPDATE.

Upvotes: 1

Pavel Kovtun

Reputation: 367

It's better to do the task using batches:

Read all the rows you need to update (i.e. 2000 rows)
Do the calculations over them in memory (using pandas or anything else)
Update all these rows in the DB in 1 batch via executemany

It will dramatically reduce the time for IO in/out the DB

Upvotes: 1

Ente

Reputation: 2462

Assuming that compute_value() can't be done on the database, I would keep it simple:

SELECT - all. Then fetchone(), run compute_value() and UPDATE that row. Your question sounds a lot like compute_value() is where you lose most of the time. Reading and committing one row at a time probably does not add much overhead and minimizes the memory impact. It also comes with the added benefit that you always commit after each compute_value(). If you think it does, I would switch to a windowed approach with fetchmany().

Investigating how to parallelize execution of compute_value() on a row-basis maybe makes sense in your case. Using multiprocessing with a job queue, where your rows are jobs, could be beneficial.

Upvotes: 1

what is the most efficient way to iterate over a very large table and update the rows?

Answers (3)

Related Questions