Reputation: 10625
I'm trying to port some old MySQL queries to PostgreSQL, but I'm having trouble with this one:
DELETE FROM logtable ORDER BY timestamp LIMIT 10;
PostgreSQL doesn't allow ordering or limits in its delete syntax, and the table doesn't have a primary key so I can't use a subquery. Additionally, I want to preserve the behavior where the query deletes exactly the given number or records -- for example, if the table contains 30 rows but they all have the same timestamp, I still want to delete 10, although it doesn't matter which 10.
So; how do I delete a fixed number of rows with sorting in PostgreSQL?
Edit: No primary key means there's no log_id
column or similar. Ah, the joys of legacy systems!
Upvotes: 174
Views: 205116
Reputation: 434585
You could try using the ctid
:
demo at db<>fiddle
DELETE FROM ONLY logtable
WHERE ctid IN (
SELECT ctid
FROM logtable
ORDER BY timestamp
LIMIT 10
);
The ctid
is:
The physical location of the row version within its table. Note that although the
ctid
can be used to locate the row version very quickly, a row'sctid
will change if it is updated or moved byVACUUM FULL
. Thereforectid
is useless as a long-term row identifier.
In case there are tables that inherit from logtable
, use DELETE FROM ONLY
to prevent this from removing up to 10 rows from each of the descendants that happens to host a row with the same ctid
. This field is only unique per table and by default, regular DELETE
cascades down the hierarchy.
If logtable
is partitioned, add tableoid
to stop this from wiping up to 10 per partition:
DELETE FROM logtable
WHERE (tableoid,ctid) IN (
SELECT tableoid,ctid
FROM logtable
ORDER BY timestamp
LIMIT 10
);
DELETE FROM ONLY
wouldn't remove anything because the main table is an empty relation, only redirecting to specific partitions.
Upvotes: 257
Reputation: 1032
Don't even bother with a LIMIT with DELETE.
What are you worried about? That you'll lock up the table? Nope, this isn't MySQL. As long as a row that gets deleted isn't used in some other transaction, you'll not block anything else.
OK OK, but sometimes you want to make sure you run potentially long-running tasks in parallel, or in cron-jobs that might run in parallel (eg, the DELETE job takes longer than the period scheduled). For that, the solution is:
BEGIN
LOCK TABLE blah_blah IN SHARE ROW EXCLUSIVE MODE ;
DELETE FROM blah_blah WHERE some_condition ;
COMMIT ;
and call it a day. If this job takes too long, the next job will simply fail noisily but no harm done. You won't run into deadlock issues, like you might with some of the pseudo-LIMIT solutions elsewhere.
OK OK, but sometimes you really really only want to delete a few because you're curious what might happen, or because you only want to delete the top X of something and return those values. But definitely not because you want to put something in a cron-job... :
DELETE FROM blah_blah WHERE primary_key IN (
SELECT primary_key FROM blah_blah WHERE some_condition
LIMIT 10
) RETURNING * ;
If you put this in cronjob or automated task please combine Level 2 && Level 3. Or deadlock-gremlins will find you and torture you.
Right, but you're a serious DBA / Sysadmin and you know better, right? You really really need to place "nice" and not hog up system resources. For instance, if you're doing PIT and doing such a big delete operation will hammer the WAL system. Or maybe (as increasingly is the case these days) you actually get charged for CPU time, and you want to keep the performance curve as smooth as possible.
We can use PL/PGSQL to create a loop to semi-automate the job for us.
We'll call the procedure batch
.
To use it, let's imagine you've got a table node
with a primary key id
and with a timestamp-field named timestamp
. You would delete the rows like this:
WITH deleted AS (
DELETE FROM node WHERE id IN (
SELECT id FROM node WHERE timestamp < now() - interval $$1 days$$
) RETURNING ID
) SELECT count(*) FROM deleted
This will delete 100 records at a time, provided those records are older than 1 day. The count deleted will be "returned" for reasons that are important as you will see. (BTW, I'm not sorting the IDs here, as some suggest you should do, because it's inefficient, and locking the table in the surrounding block makes more sense.) ....
To invoke this batchx
procedure, with the above query, you'll do it like this:
CALL batch( 300, 100,
'WITH deleted AS (
DELETE FROM node WHERE id IN (
SELECT id FROM node WHERE timestamp < now() - interval $$1 days$$
ORDER BY id
LIMIT @@
) RETURNING ID
) SELECT count(*) FROM deleted' );
This will run the query repeatedly in batches of 100 (second parameter) for up to 300 seconds (first parameter), or until no more records are deleted. Note the use of @@
instead of the number! This allows you to parameterize the limit so that you don't have to modify the query each time. (This will come in handy in Level 5).
Please note:the code replaces the string LIMIT @@
including the leading space. Make sure it's not a newline or tab or something.)
create or replace procedure batch( max_time double precision, chunk_size int, query varchar )
language plpgsql
as $$
declare
final_query varchar;
loop_time double precision;
target_end timestamp;
loop_timestamp timestamp;
last_timestamp timestamp:=now();
result int;
i int:=0;
begin
target_end := clock_timestamp() + INTERVAL '1 second' * max_time;
final_query := REPLACE( query, ' LIMIT @@', ' LIMIT ' || chunk_size::varchar);
LOOP
i:=i+1;
EXECUTE final_query INTO result;
loop_timestamp = clock_timestamp();
loop_time := extract(epoch from ( loop_timestamp - last_timestamp) )::DOUBLE PRECISION;
last_timestamp := loop_timestamp;
RAISE info 'loop %: count= % loop_time= % records/s= %', i, result, loop_time, to_char(result / loop_time, 'FM99999999.00');
EXIT WHEN result = 0 OR loop_timestamp > target_end ;
END LOOP;
end
$$;
That was cool and all, but now it's time to bring it to the next and final level: let the procedure auto-scale the limit according to the performance. Instead of specifying the limit-size, specify the time you want wach batch to run.
Let's call this the autobatch
procedure:
create or replace procedure autobatch( max_time double precision, chunk_time double precision, query varchar )
language plpgsql
as $$
declare
max_scaleup double precision:=2;
max_scaledown double precision:=0.1;
next_limit int;
loop_time double precision;
factor double precision;
result int;
target_end timestamp;
loop_timestamp timestamp;
last_loop_timestamp timestamp:=now();
final_query varchar;
i int:=0;
begin
target_end := clock_timestamp() + INTERVAL '1 second' * max_time;
next_limit := max_time ;
LOOP
i:=i+1;
final_query := REPLACE( query, ' LIMIT @@', ' LIMIT ' || next_limit::varchar);
EXECUTE final_query INTO result;
loop_timestamp = clock_timestamp();
loop_time := extract(epoch from ( loop_timestamp - last_loop_timestamp) )::DOUBLE PRECISION;
factor := GREATEST( max_scaledown, LEAST( max_scaleup, chunk_time / loop_time ));
next_limit := next_limit * factor;
last_loop_timestamp := loop_timestamp;
RAISE info 'loop %: count = % next_limit = % loop_time = %', i, result, next_limit, loop_time;
EXIT WHEN result = 0 OR loop_timestamp > target_end ;
END LOOP;
end
$$;
Sample output:
call autobatch( 1200, 30, 'with deleted as (delete from node where id in ( select id from node where timestamp < now() - interval $$1 days$$ LIMIT @@ ) returning id) select count(*) from deleted' );
INFO: loop 1: count = 1200 next_limit = 2400 loop_time = 3.210725
INFO: loop 2: count = 2400 next_limit = 4800 loop_time = 6.350344
INFO: loop 3: count = 4800 next_limit = 9600 loop_time = 12.373832
INFO: loop 4: count = 9600 next_limit = 12261 loop_time = 23.489029
INFO: loop 5: count = 12261 next_limit = 12338 loop_time = 29.812288
INFO: loop 6: count = 12338 next_limit = 12675 loop_time = 29.202712
INFO: loop 7: count = 12675 next_limit = 13360 loop_time = 28.461273
See how it adjusts the limit-size to maximize the batch-time? If it goes over, it will scale down to. Scaling up is capped at 2x, while scaling down is capped at 1/10th. That's mainly to stabilize behavior when queries result in cache hits.
Upvotes: 9
Reputation: 83
If you don't have a primary key you can use the array Where IN syntax with a composite key.
delete from table1 where (schema,id,lac,cid) in (select schema,id,lac,cid from table1 where lac = 0 limit 1000);
This worked for me.
Upvotes: 5
Reputation: 146
Assuming you want to delete ANY 10 records (without the ordering) you could do this:
DELETE FROM logtable as t1 WHERE t1.ctid < (select t2.ctid from logtable as t2 where (Select count(*) from logtable t3 where t3.ctid < t2.ctid ) = 10 LIMIT 1);
For my use case, deleting 10M records, this turned out to be faster.
Upvotes: 2
Reputation: 1599
Postgres docs recommend to use array instead of IN and subquery. This should work much faster
DELETE FROM logtable
WHERE id = any (array(SELECT id FROM logtable ORDER BY timestamp LIMIT 10));
This and some other tricks can be found here
Upvotes: 77
Reputation: 54005
delete from logtable where log_id in (
select log_id from logtable order by timestamp limit 10);
Upvotes: 23
Reputation: 8821
You could write a procedure which loops over the delete for individual lines, the procedure could take a parameter to specify the number of items you want to delete. But that's a bit overkill compared to MySQL.
Upvotes: 0