Optimizing a long series of SQL update queries with Psycopg2

Question

I need to make a huge number of SQL queries that update or insert rows using Psycopg2. There are no other queries being run intermediately. Example with a table A having columns name and value:

% Basically models a list of strings and how many times they "appear"
% 'foo' is some random value each time, sometimes repeating
insert into A select ('foo', 0)
    where not exists(select 1 from A where name = 'foo' limit 1);
update A set value = value + 1 where name = 'foo';
% ... and many more just like this

This is just an example, one type of query I'm running. I'm doing other things too. I'm not looking for a solution involving reworking my SQL queries.

It's really slow, with Postgres (which is running on another server) bottlenecking it. I've tried various things to make it faster.

It was unbearably slow if I committed after every query.
It was a bit faster if I didn't connection.commit() until the end. This seems to be what the Psycopg2 documentation suggests I do. Postgres was still bottlenecking horribly on disk access.
It was much faster if I used cursor.mogrify() instead of cursor.execute(), stored all the queries in a big list, joined them at the end into one massive query (literally ";".join(qs)), and ran it. Postgres was using 100% CPU, a good sign because that means ~ no disk bottleneck. But that was sometimes causing the postgres process to use up all my RAM and start page faulting then get bottlenecked on disk access forever, a disaster. I've set all the memory limits for Postgres to reasonable values using pgtune, but I'm guessing Postgres is allocating a bunch of work buffers with no limit and going over.
I've tried the above solution except committing every 100,000 or so queries to avoid overloading the server, but that's not going to be a perfect solution. It's what I've got for now. It seems like a ridiculous hack and is still slower than I'd like.

Is there some other way I should try involving Psycopg2?

Michael Robellard · Accepted Answer

Sounds like you have a lot of issues here. The first is that Postgres should not page fault unless you have it improperly configured or you are running other services on the machine. A properly configured Postgres instance will use your memory, but it won't pagefault.

If you need to insert or update 100,000s of things at a time, You definetly do not want to do that 1 transaction at a time, as you noted that will be very slow. In your first example what is happening is you are sending each query to the db over the network, waiting for a result, then commiting and waiting for that result, once again over the network.

Stringing together multiple things at a time will save you the 1 commit per and the back and forth network traffic which is why you saw significantly faster performance.

You can take the stringing together a step further and use copy if you are doing inserts or use a value list instead of single insert or update statements.

The real problem is a design flow in what you are doing. What you are doing from the looks of your query is implementing a counter in your database. If you are only going to count a few hundred things here or there, no big deal, but when you get into the 100,000s+ It won't work well.

This is where tools like memcached and redis come in. Both have excellent tools for very fast in memory counters. (If you only have one server you could just implement a counter in your code.) Once you have things counted just create a process to save the count to the database and clear the in memory counters.

Optimizing a long series of SQL update queries with Psycopg2

Answers (1)

Related Questions