DWilches
DWilches

Reputation: 23035

Multiple deletes in Postgres with a commit after each one

I have about 30 million records to delete from a table, and deleting even 10.000 is taking 30 minutes. I'm concerned about issuing the delete command for all the 30 million records, so I'd like to do the delete in batches.

So my approach was to do a loop deleting a batch, then commit, then loop to delete the next batch. But that generates the following error:

LOCATION:  exec_stmt_raise, pl_exec.c:3216
ERROR:  0A000: cannot begin/end transactions in PL/pgSQL
HINT:  Use a BEGIN block with an EXCEPTION clause instead.

This is the code I wrote:

DO $$
BEGIN
    FOR i in 1..30000 loop
        DELETE FROM my_table
            WHERE id IN (
                SELECT id 
                    FROM my_table
                    WHERE should_delete = true
                    LIMIT 1000
            );
        RAISE NOTICE 'Done with batch %', i;
        COMMIT;
    END LOOP;
END
$$;

What is an alternative to achieving this?

Upvotes: 0

Views: 5167

Answers (1)

Kevin
Kevin

Reputation: 30161

A few things jump out at me:

  1. Transaction overhead in PostgreSQL is significant. It's possible you would see a substantial performance improvement if you just did this all in one big transaction.
  2. It sounds as if you are experimenting on a live production database. Don't do that. Set up a test instance with statistically similar (or identical) data and a similar workload, and use that. That way, if you manage to break something, you'll only hurt the test instance.
  3. Once you have a test instance via (2), you can use it to test (1) and see how long it takes. Then you won't have to guess at which approach is superior.
  4. You mention that other jobs run at fixed times. So this is not a database backing (something like) a web server exposed to the internet. It is a batch system which runs scheduled jobs at predetermined times. Since you already have a scheduling system, use it to schedule your deletes at times when the system is not in use (or when it is unlikely to be in use).
  5. If you decide that you must use multiple transactions, use something other than PL/pgSQL to do the actual looping. You could, for instance, use a shell script or another programming language like Python or Java. Anything with Postgres bindings will do.
  6. The really drastic approach is to replicate the whole database, perform the deletes on the replica, and then replace the original with the replica. The swap may require putting the database into read-only mode for a short time to avoid write inconsistencies and to allow the replica to converge. Since yours is a batch system, that might not matter. Obviously this is the most resource-intensive approach since it requires a whole extra database.

Upvotes: 1

Related Questions