Reputation: 64739
Is there an especially efficient way to bulk-update a simple boolean field in Postgres for a large number or records?
I have a table containing millions of rows, and occasionally I want to mark a fresh=false
on a large but well-indexed subset of those rows.
However, if I try to do the obvious:
UPDATE mytable SET fresh=false WHERE mycriteria;
it runs for hours, consumes all memory, starts to swap, rendering my machine nearly unusable, forcing me to kill the process, resulting in no change to any data in the database.
Instead, I've written a bash script to run this update in mini-chunks of a few thousand records at a time, which still takes hours, but at least gets the job done and gives me progress information to boot. Is there a better way?
Upvotes: 0
Views: 536
Reputation: 324511
It runs for hours, consumes all memory, starts to swap, rendering my machine nearly unusable, forcing me to kill the process, resulting in no change to any data in the database.
Based on the description that you probably have AFTER UPDATE ... FOR EACH ROW
triggers defined.
At present, PostgreSQL (true in 9.4 and prior, at least) uses an in-memory queue for triggers. It's an efficient queue, but it's still in-memory, and after a few million rows that starts to really add up.
To confirm that this is the case you should attach gdb
to the postgres
process doing the work once it's getting big, using gdb -p the-big-postgres-process-id
, e.g. gdb -p 1234
if 1234
is the pid of the postgres
that shows up as using lots of RAM in top
. Or you can SELECT pg_backend_pid()
before running your UPDATE
.
Either way, once you've got gdb
attached and you're at the (gdb)
prompt run:
(gdb) p MemoryContextStats(TopMemoryContext)
(gdb) detach
(gdb) quit
If gdb
complains about missing symbols you may have to install a debuginfo package first; see the instructions on the wiki.
This will confirm where the memory is really going.
If this does turn out to be AFTER UPDATE ... FOR EACH ROW
triggers, your options are:
Use a FOR EACH STATEMENT
trigger instead. There's no way to get NEW
and OLD
in that case.
Use a BEFORE
trigger; or
Sponsor development of spill-to-disk-storage for the AFTER
trigger queue ;-)
BTW, one thing to keep in mind is that if you have a 100-column wide table and you update one field, every column still has to be copied and written to the new row copy because of MVCC. The exception is TOASTable columns that're stored out-of-line (non-trivial text fields, arrays, bytea fields, etc etc); if not modified they don't have to be copied. So a "trivial" update may not be as trivial as you think.
Upvotes: 2