Reputation: 64739

Efficiently bulk-updated boolean field in PostgreSQL

Is there an especially efficient way to bulk-update a simple boolean field in Postgres for a large number or records?

I have a table containing millions of rows, and occasionally I want to mark a fresh=false on a large but well-indexed subset of those rows.

However, if I try to do the obvious:

UPDATE mytable SET fresh=false WHERE mycriteria;

it runs for hours, consumes all memory, starts to swap, rendering my machine nearly unusable, forcing me to kill the process, resulting in no change to any data in the database.

Instead, I've written a bash script to run this update in mini-chunks of a few thousand records at a time, which still takes hours, but at least gets the job done and gives me progress information to boot. Is there a better way?

Upvotes: 0

Answers (1)

Craig Ringer

Reputation: 324511

It runs for hours, consumes all memory, starts to swap, rendering my machine nearly unusable, forcing me to kill the process, resulting in no change to any data in the database.

Based on the description that you probably have AFTER UPDATE ... FOR EACH ROW triggers defined.

At present, PostgreSQL (true in 9.4 and prior, at least) uses an in-memory queue for triggers. It's an efficient queue, but it's still in-memory, and after a few million rows that starts to really add up.

To confirm that this is the case you should attach gdb to the postgres process doing the work once it's getting big, using gdb -p the-big-postgres-process-id, e.g. gdb -p 1234 if 1234 is the pid of the postgres that shows up as using lots of RAM in top. Or you can SELECT pg_backend_pid() before running your UPDATE.

Either way, once you've got gdb attached and you're at the (gdb) prompt run:

(gdb) p MemoryContextStats(TopMemoryContext)
(gdb) detach
(gdb) quit

If gdb complains about missing symbols you may have to install a debuginfo package first; see the instructions on the wiki.

This will confirm where the memory is really going.

If this does turn out to be AFTER UPDATE ... FOR EACH ROW triggers, your options are:

Use a FOR EACH STATEMENT trigger instead. There's no way to get NEW and OLD in that case.
Use a BEFORE trigger; or
Sponsor development of spill-to-disk-storage for the AFTER trigger queue ;-)

BTW, one thing to keep in mind is that if you have a 100-column wide table and you update one field, every column still has to be copied and written to the new row copy because of MVCC. The exception is TOASTable columns that're stored out-of-line (non-trivial text fields, arrays, bytea fields, etc etc); if not modified they don't have to be copied. So a "trivial" update may not be as trivial as you think.

Upvotes: 2

Efficiently bulk-updated boolean field in PostgreSQL

Answers (1)

Related Questions