Reputation: 3301
We are having an issue with a simple query on a transaction table (approx. 5m rows). I have simplified the query to demonstrate the issue to:
select *
from "Customers" c
where exists
(
select 1
from "Transactions" t
where t."CustomerId" = c."Id"
)
This completes in 4ms and analysing the query shows it uses an Index Only Scan on the transactions table. If I change the query to this:
select *
from "Customers" c
where exists
(
select 1
from "Transactions" t
where t."CustomerId" = c."Id"
) or false
It takes several minutes to complete and shows a Seq Scan.
or false is just an example, in the system there is a new requirement for a bool on a customer which indicates it should return regardless of existing transactions.
How can I make Postgresql use the index, I am from a SQL Server background and this issue doesn't seem to occur when I try this same query and data there.
Update:
Using Analyze, Buffers took 12 minutes to complete with results:
Seq Scan on "Customers" c (cost=0.00..120.23 rows=386 width=167) (actual time=8370.080..731218.657 rows=3 loops=1)
Filter: (SubPlan 1)
Rows Removed by Filter: 769
Buffers: shared hit=24614 read=115961604
SubPlan 1
-> Seq Scan on "Transactions" t (cost=0.00..210163.44 rows=1582732 width=0) (actual time=947.166..947.166 rows=0 loops=772)
Filter: ("CustomerId" = c."Id")
Rows Removed by Filter: 4730230
Buffers: shared hit=24613 read=115961595 |
Planning Time: 0.132 ms
Execution Time: 731218.756 ms
There are only 3 customers with transaction data:
18 - 36321
23 - 4190090
35 - 521784
If I remove "or false" it completes instantly with Analyze, Buffers with results:
Nested Loop Semi Join (cost=0.43..365.67 rows=3 width=167) (actual time=1.891..3.312 rows=3 loops=1)
Buffers: shared hit=2317 read=12
-> Seq Scan on "Customers" c (cost=0.00..17.72 rows=772 width=167) (actual time=0.008..0.097 rows=772 loops=1)
Buffers: shared hit=10
-> Index Only Scan using "IX_Transactions_CustomerId" on "Transactions" t (cost=0.43..28599.01 rows=1582732 width=4) (actual time=0.004..0.004 rows=0 loops=772)
Index Cond: ("CustomerId" = c."Id")
Heap Fetches: 3
Buffers: shared hit=2307 read=12
Planning Time: 2.436 ms
Execution Time: 3.336 ms
Update 2:
If change to use a CTE like below it completes in 4ms, seems more complicated but I will go with this solution but I would very much like to understand why "or false" is causing a full table scan on transactions rather than the index only scan.
with cte_cust as
(
select c."Id"
from "Customers" c
where exists
(
select 1
from "Transactions" t
where t."CustomerId" = c."Id"
) --or false
)
select *
from "Customers" c
left join cte_cust on cte_cust."Id" = c."Id"
where cte_cust is not null or false
Upvotes: 1
Views: 632
Reputation: 247535
It would be best to write the query as a UNION
of two queries:
SELECT *
FROM "Customers" c
WHERE EXISTS (SELECT 1
FROM "Transactions" t
WHERE t."CustomerId" = c."Id")
UNION
SELECT *
FROM "Customers" c
WHERE /* your other condition */;
As long as "Customers"
has a primary key, that will return the same result. Even though the query looks simpler, both parts can be processed efficiently, and you will be faster.
To answer why PostgreSQL behaves that way:
If there was something more complicated than FALSE
in the OR
branch, how would you want to convert that into a semi-join?
PostgreSQL could certainly think harder and remove the OR FALSE
earlier on, but thinking harder costs CPU time during planning. Since PostgreSQL doesn't cache execution plans, that is particularly sensitive. Typically, PostgreSQL is reluctant to add such extra processing to handle “silly” special cases like this – the advice is to write the query better.
Upvotes: 1