Reputation: 3983
I'm wondering if somebody can explain why this runs so much longer using CTEs rather than temp tables... I'm basically deleting duplicate information out of a customer table (why duplicate information exists is beyond the scope of this post).
This is Postgres 9.5.
The CTE version is this:
with targets as
(
select
id,
row_number() over(partition by uuid order by created_date desc) as rn
from
customer
)
delete from
customer
where
id in
(
select
id
from
targets
where
rn > 1
);
I killed that version this morning after running for over an hour.
The temp table version is this:
create temp table
targets
as select
id,
row_number() over(partition by uuid order by created_date desc) as rn
from
customer;
delete from
customer
where
id in
(
select
id
from
targets
where
rn > 1
);
This version finishes in about 7 seconds.
Any idea what may be causing this?
Upvotes: 0
Views: 7989
Reputation: 44250
The CTE is slower because it has to be executed unaltered (via a CTE scan).
It is thus an optimisation barrier; for the optimiser, dismantling the CTE is not allowed, even if it would result in a smarter plan with the same results.
The CTE-solution can be refactored into a joined subquery, though (similar to the temp table in the question). In postgres, a joined subquery is usually faster than the EXISTS() variant, nowadays.
DELETE FROM customer del
USING ( SELECT id
, row_number() over(partition by uuid order by created_date desc)
as rn
FROM customer
) sub
WHERE sub.id = del.id
AND sub.rn > 1
;
Another way is to use a TEMP VIEW
. This is syntactically equivalent to the temp table
case, but semantically equivalent to the joined subquery form (they yield exactly the same query plan, at least in this case). This is because Postgres's optimiser dismantles the view and combines it with the main query (pull-up). You could see a view
as a kind of macro in PG.
CREATE TEMP VIEW targets
AS SELECT id
, row_number() over(partition by uuid ORDER BY created_date DESC) AS rn
FROM customer;
EXPLAIN
DELETE FROM customer
WHERE id IN ( SELECT id
FROM targets
WHERE rn > 1
);
[UPDATED: I was wrong about the CTEs need to be always-executed-to-completion, which is only the case for data-modifying CTEs]
Upvotes: 4
Reputation: 47392
Using a CTE is likely going to cause different bottlenecks than using a temporary table. I'm not familiar with how PostgreSQL implements CTE, but it is likely in memory, so if your server is memory starved and the resultset of your CTE is very large then you could run into issues there. I would monitor the server while running your query and try to find where the bottleneck is.
An alternative way to doing that delete which might be faster than both of your methods:
DELETE C
FROM
Customer C
WHERE
EXISTS (SELECT * FROM Customer C2 WHERE C2.uuid = C.uuid AND C2.created_date > C.created_date)
That won't handle situations where you have exact matches with created_date
, but that can be solved by adding the id
to the subquery as well.
Upvotes: 1