Whatsit
Whatsit

Reputation: 10625

How do I delete a fixed number of rows with sorting in PostgreSQL?

I'm trying to port some old MySQL queries to PostgreSQL, but I'm having trouble with this one:

DELETE FROM logtable ORDER BY timestamp LIMIT 10;

PostgreSQL doesn't allow ordering or limits in its delete syntax, and the table doesn't have a primary key so I can't use a subquery. Additionally, I want to preserve the behavior where the query deletes exactly the given number or records -- for example, if the table contains 30 rows but they all have the same timestamp, I still want to delete 10, although it doesn't matter which 10.

So; how do I delete a fixed number of rows with sorting in PostgreSQL?

Edit: No primary key means there's no log_id column or similar. Ah, the joys of legacy systems!

Upvotes: 174

Views: 205116

Answers (7)

mu is too short
mu is too short

Reputation: 434585

You could try using the ctid:
demo at db<>fiddle

DELETE FROM ONLY logtable
WHERE ctid IN (
    SELECT ctid
    FROM logtable
    ORDER BY timestamp
    LIMIT 10
);

The ctid is:

The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier.

In case there are tables that inherit from logtable, use DELETE FROM ONLY to prevent this from removing up to 10 rows from each of the descendants that happens to host a row with the same ctid. This field is only unique per table and by default, regular DELETE cascades down the hierarchy.

If logtable is partitioned, add tableoid to stop this from wiping up to 10 per partition:

DELETE FROM logtable
WHERE (tableoid,ctid) IN (
    SELECT tableoid,ctid
    FROM logtable
    ORDER BY timestamp
    LIMIT 10
);

DELETE FROM ONLY wouldn't remove anything because the main table is an empty relation, only redirecting to specific partitions.

Upvotes: 257

Otheus
Otheus

Reputation: 1032

Level 1

Don't even bother with a LIMIT with DELETE.

What are you worried about? That you'll lock up the table? Nope, this isn't MySQL. As long as a row that gets deleted isn't used in some other transaction, you'll not block anything else.

Level 2

OK OK, but sometimes you want to make sure you run potentially long-running tasks in parallel, or in cron-jobs that might run in parallel (eg, the DELETE job takes longer than the period scheduled). For that, the solution is:

BEGIN
  LOCK TABLE blah_blah IN SHARE ROW EXCLUSIVE MODE ;
  DELETE FROM blah_blah WHERE some_condition ;
COMMIT ;

and call it a day. If this job takes too long, the next job will simply fail noisily but no harm done. You won't run into deadlock issues, like you might with some of the pseudo-LIMIT solutions elsewhere.

Level 3

OK OK, but sometimes you really really only want to delete a few because you're curious what might happen, or because you only want to delete the top X of something and return those values. But definitely not because you want to put something in a cron-job... :

  DELETE FROM blah_blah WHERE primary_key IN ( 
    SELECT primary_key FROM blah_blah WHERE some_condition
    LIMIT 10 
  ) RETURNING * ;

If you put this in cronjob or automated task please combine Level 2 && Level 3. Or deadlock-gremlins will find you and torture you.

Level 4

Right, but you're a serious DBA / Sysadmin and you know better, right? You really really need to place "nice" and not hog up system resources. For instance, if you're doing PIT and doing such a big delete operation will hammer the WAL system. Or maybe (as increasingly is the case these days) you actually get charged for CPU time, and you want to keep the performance curve as smooth as possible.

We can use PL/PGSQL to create a loop to semi-automate the job for us. We'll call the procedure batch.

To use it, let's imagine you've got a table node with a primary key id and with a timestamp-field named timestamp. You would delete the rows like this:

  WITH deleted AS ( 
    DELETE FROM node WHERE id IN ( 
      SELECT id FROM node WHERE timestamp < now() - interval $$1 days$$ 
    ) RETURNING ID 
   ) SELECT count(*) FROM deleted

This will delete 100 records at a time, provided those records are older than 1 day. The count deleted will be "returned" for reasons that are important as you will see. (BTW, I'm not sorting the IDs here, as some suggest you should do, because it's inefficient, and locking the table in the surrounding block makes more sense.) ....

To invoke this batchx procedure, with the above query, you'll do it like this:

CALL batch( 300, 100, 
  'WITH deleted AS ( 
    DELETE FROM node WHERE id IN ( 
      SELECT id FROM node WHERE timestamp < now() - interval $$1 days$$ 
      ORDER BY id 
      LIMIT @@ 
    ) RETURNING ID 
   ) SELECT count(*) FROM deleted' );

This will run the query repeatedly in batches of 100 (second parameter) for up to 300 seconds (first parameter), or until no more records are deleted. Note the use of @@ instead of the number! This allows you to parameterize the limit so that you don't have to modify the query each time. (This will come in handy in Level 5).

Please note:the code replaces the string LIMIT @@ including the leading space. Make sure it's not a newline or tab or something.)

create or replace procedure batch( max_time double precision, chunk_size int, query varchar )
language plpgsql
as $$
declare 
  final_query     varchar;
  loop_time       double precision;
  target_end      timestamp;
  loop_timestamp  timestamp;
  last_timestamp  timestamp:=now();
  result          int;
  i               int:=0;
begin
  target_end := clock_timestamp() + INTERVAL '1 second' * max_time;
  final_query := REPLACE( query, ' LIMIT @@', ' LIMIT ' || chunk_size::varchar);
  LOOP
    i:=i+1;
    EXECUTE final_query INTO result;
    loop_timestamp = clock_timestamp();
    loop_time := extract(epoch from ( loop_timestamp - last_timestamp) )::DOUBLE PRECISION;
    last_timestamp := loop_timestamp;
    RAISE info  'loop %: count= %  loop_time= %  records/s= %', i, result, loop_time, to_char(result / loop_time, 'FM99999999.00');
    EXIT WHEN result = 0 OR loop_timestamp > target_end ;
  END LOOP;
end
$$;

Level 5

That was cool and all, but now it's time to bring it to the next and final level: let the procedure auto-scale the limit according to the performance. Instead of specifying the limit-size, specify the time you want wach batch to run.

Let's call this the autobatch procedure:

create or replace procedure autobatch( max_time double precision, chunk_time double precision, query varchar )
language plpgsql
as $$
declare 
  max_scaleup   double precision:=2;
  max_scaledown double precision:=0.1;
  next_limit    int;
  loop_time     double precision;
  factor        double precision;
  result        int;
  target_end    timestamp;
  loop_timestamp  timestamp;
  last_loop_timestamp timestamp:=now();
  final_query varchar;
  i int:=0;
begin
  target_end := clock_timestamp() + INTERVAL '1 second' * max_time;
  next_limit := max_time ;
  LOOP
    i:=i+1;

    final_query := REPLACE( query, ' LIMIT @@', ' LIMIT ' || next_limit::varchar);

    EXECUTE final_query INTO result;

    loop_timestamp = clock_timestamp();
    loop_time := extract(epoch from ( loop_timestamp - last_loop_timestamp) )::DOUBLE PRECISION;
    factor := GREATEST( max_scaledown, LEAST( max_scaleup, chunk_time / loop_time ));
    next_limit := next_limit * factor;
    last_loop_timestamp := loop_timestamp;
    RAISE info  'loop %: count = %  next_limit = %  loop_time = %', i, result, next_limit, loop_time;
    EXIT WHEN result = 0 OR loop_timestamp > target_end ;
  END LOOP;
end
$$;

Sample output:

call autobatch( 1200, 30, 'with deleted as (delete from node where id in ( select id from node where timestamp < now() - interval $$1 days$$ LIMIT @@ ) returning id) select count(*) from deleted' );
INFO:  loop 1: count = 1200  next_limit = 2400  loop_time = 3.210725
INFO:  loop 2: count = 2400  next_limit = 4800  loop_time = 6.350344
INFO:  loop 3: count = 4800  next_limit = 9600  loop_time = 12.373832
INFO:  loop 4: count = 9600  next_limit = 12261  loop_time = 23.489029
INFO:  loop 5: count = 12261  next_limit = 12338  loop_time = 29.812288
INFO:  loop 6: count = 12338  next_limit = 12675  loop_time = 29.202712
INFO:  loop 7: count = 12675  next_limit = 13360  loop_time = 28.461273

See how it adjusts the limit-size to maximize the batch-time? If it goes over, it will scale down to. Scaling up is capped at 2x, while scaling down is capped at 1/10th. That's mainly to stabilize behavior when queries result in cache hits.

Upvotes: 9

user2449151
user2449151

Reputation: 83

If you don't have a primary key you can use the array Where IN syntax with a composite key.

delete from table1 where (schema,id,lac,cid) in (select schema,id,lac,cid from table1 where lac = 0 limit 1000);

This worked for me.

Upvotes: 5

Patrick H&#252;sler
Patrick H&#252;sler

Reputation: 146

Assuming you want to delete ANY 10 records (without the ordering) you could do this:

DELETE FROM logtable as t1 WHERE t1.ctid < (select t2.ctid from logtable as t2  where (Select count(*) from logtable t3  where t3.ctid < t2.ctid ) = 10 LIMIT 1);

For my use case, deleting 10M records, this turned out to be faster.

Upvotes: 2

criticus
criticus

Reputation: 1599

Postgres docs recommend to use array instead of IN and subquery. This should work much faster

DELETE FROM logtable 
WHERE id = any (array(SELECT id FROM logtable ORDER BY timestamp LIMIT 10));

This and some other tricks can be found here

Upvotes: 77

Konrad Garus
Konrad Garus

Reputation: 54005

delete from logtable where log_id in (
    select log_id from logtable order by timestamp limit 10);

Upvotes: 23

Bernhard
Bernhard

Reputation: 8821

You could write a procedure which loops over the delete for individual lines, the procedure could take a parameter to specify the number of items you want to delete. But that's a bit overkill compared to MySQL.

Upvotes: 0

Related Questions