Which way of locking access to a resource is the most pythonic/correct?

Question

I'm building a multithreaded program because it has a ton of I/O. I want to build a queue that is continually populated with some partial information from a database, and replenish it when it gets low. All threads, at the end of a cycle, will check to see whether the queue needs replenishing. Which of these is the most accurate (mind you it's pseduo-code, since I'm only interested in the higher level concept).

Also, forgive me, I'm struggling a little bit with namespaces too, as you might see from the code. I've researched this heavily, tested heavily, and I'm still struggling, so any help is much appreciated!

# Option 1 - lock within the thread
class Thread():
    ...
    def run():
        # use an item off the queue
        with lock:
            replenish_queue()

lock = threading.Lock()
def replenish_queue():
    #check if queue needs replenishing

# ----------------------------------------

# Option 2 - lock within function
class Thread():
    ...
    def run():
        # use an item off the queue
        replenish_queue():

lock = threading.Lock()
def replenish_queue():
    with lock:
        #check if queue needs replenishing

khampson · Accepted Answer

With a huge amount of concurrent database writes, I think one of your main bottlenecks is actually likely to be sqlite itself. Write locks happen in sqlite at the database level (http://www.sqlite.org/faq.html).

This is in contrast to other RDBMS's, such as Postgres, which offer locking at the table level for updates, in conjunction with MVCC, which in terms of updates means that an UPDATE to a table, which will lock it for write, won't block other SELECTs from running.

So, whether you're accomplishing it through threading or multiple processes, trying to concurrently write to the same sqlite database is essentially a serial process, since sqlite requires a write lock be taken out on the entire DB for the duration of the write.

You can, however, have concurrent reads on an sqlite database. So if your usage pattern is such that you can do a bunch of reads, process them (from an internal queue or similar), and then do the writes serially (since sqlite essentially enforces a mutex on the database for writes anyway), that would probably be optimal.

In terms of the queue itself, a generator, as was mentioned by John Mee is definitely one option, and may be best depending on your exact data needs.

However, another worth considering is using the database itself as a queue -- SQL tables are generally good at that; I do it all the time in Postgres, and I suspect they work similarly well in sqlite. This would of course work best if your requirements are in one table, or could be consolidated easily into one queue. If your requirements include tons of disparate tables that aren't easily JOINed, or tables from different databases, then there would need to be an ETL layer in between to get it in to queue form, which may not be worth it at that point.

In the case it does fit well using the DB as a queue, using your sqlite library (say, sqlite3), you would declare your cursor and could either use it as an iterator to go over it one by one, or, fetchmany with a size parameter equivalent to the batch you want to process at once -- let's say 100.

You could then parcel that out to another process (i.e. using multiprocessing) to do the work while the main thread fetches the next chunk. It may worth it first, though, to have a single thread process each chunk -- tinker with the chunk size of the fetch to see if that helps, as often time it's the DB interaction where the real time is spent -- and see if the main thread can just whip through the chunks in an acceptable time on its own. If not, then you can always parcel it out to a bunch of workers and join them up afterwards.

When the cursor runs out of chunks to fetch, you're done the main read work, and forked processes (or the main process if you're going that route) are doing their processing throughout, and once those are all done you can then update them back in the main thread serially.

Multiple threads/processes are certainly sometimes needed, especially when doing a ton of I/O, but that's not always the case -- sometimes well planned/chunked database fetches can help a ton and obviate the need to add the extra complexity of multithreading/multiprocessing -- so I would recommend starting with that and building out from there only if actual observed performance dictates that.

Edit:

I see from one of the comments that you're already queueing using the database. If by this you mean what I was talking about above, then I think fetching in batches -- and tinkering to find the optimal size -- should help with performance (less roundtrips to the DB = less overhead, even with a local DB) -- and will definitely help with memory consumption since, if you have, say, 10 million items in the queue, you'll only fetch size items at a time.

Which way of locking access to a resource is the most pythonic/correct?

Answers (1)

Related Questions