user351711
user351711

Reputation: 3301

Postgresql Slow Exists OR clause added (Using Seq Scan instead of Index)

We are having an issue with a simple query on a transaction table (approx. 5m rows). I have simplified the query to demonstrate the issue to:

select *
from "Customers" c
where exists
(
  select 1 
    from "Transactions" t 
    where t."CustomerId" = c."Id"
) 

This completes in 4ms and analysing the query shows it uses an Index Only Scan on the transactions table. If I change the query to this:

select *
from "Customers" c
where exists
(
  select 1 
    from "Transactions" t 
    where t."CustomerId" = c."Id"
) or false

It takes several minutes to complete and shows a Seq Scan.

or false is just an example, in the system there is a new requirement for a bool on a customer which indicates it should return regardless of existing transactions.

How can I make Postgresql use the index, I am from a SQL Server background and this issue doesn't seem to occur when I try this same query and data there.

Update:

Using Analyze, Buffers took 12 minutes to complete with results:

Seq Scan on "Customers" c  (cost=0.00..120.23 rows=386 width=167) (actual time=8370.080..731218.657 rows=3 loops=1)          
   Filter: (SubPlan 1)                                                           
   Rows Removed by Filter: 769                                       
   Buffers: shared hit=24614 read=115961604                          
   SubPlan 1                                                         
        ->  Seq Scan on "Transactions" t  (cost=0.00..210163.44 rows=1582732 width=0) (actual time=947.166..947.166 rows=0 loops=772)
              Filter: ("CustomerId" = c."Id")
              Rows Removed by Filter: 4730230
              Buffers: shared hit=24613 read=115961595                                                                               | 
Planning Time: 0.132 ms                                             
Execution Time: 731218.756 ms                                       

There are only 3 customers with transaction data:

18 - 36321
23 - 4190090
35 - 521784

If I remove "or false" it completes instantly with Analyze, Buffers with results:

Nested Loop Semi Join  (cost=0.43..365.67 rows=3 width=167) (actual time=1.891..3.312 rows=3 loops=1)
   Buffers: shared hit=2317 read=12                                  
   ->  Seq Scan on "Customers" c  (cost=0.00..17.72 rows=772 width=167) (actual time=0.008..0.097 rows=772 loops=1)                
        Buffers: shared hit=10
   ->  Index Only Scan using "IX_Transactions_CustomerId" on "Transactions" t  (cost=0.43..28599.01 rows=1582732 width=4) (actual time=0.004..0.004 rows=0 loops=772)
        Index Cond: ("CustomerId" = c."Id")
        Heap Fetches: 3
        Buffers: shared hit=2307 read=12
Planning Time: 2.436 ms                                                     
Execution Time: 3.336 ms                                            

Update 2:

If change to use a CTE like below it completes in 4ms, seems more complicated but I will go with this solution but I would very much like to understand why "or false" is causing a full table scan on transactions rather than the index only scan.

with cte_cust as
(
   select c."Id"
   from "Customers" c
   where exists
   (
        select 1 
        from "Transactions" t 
        where t."CustomerId" = c."Id"
   ) --or false
)
select *
from "Customers" c
left join cte_cust on cte_cust."Id" = c."Id"
where cte_cust is not null or false

Upvotes: 1

Views: 632

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 247535

It would be best to write the query as a UNION of two queries:

SELECT *
FROM "Customers" c
WHERE EXISTS (SELECT 1 
                 FROM "Transactions" t 
                 WHERE t."CustomerId" = c."Id")
UNION
SELECT *
FROM "Customers" c
WHERE /* your other condition */;

As long as "Customers" has a primary key, that will return the same result. Even though the query looks simpler, both parts can be processed efficiently, and you will be faster.

To answer why PostgreSQL behaves that way:

If there was something more complicated than FALSE in the OR branch, how would you want to convert that into a semi-join?

PostgreSQL could certainly think harder and remove the OR FALSE earlier on, but thinking harder costs CPU time during planning. Since PostgreSQL doesn't cache execution plans, that is particularly sensitive. Typically, PostgreSQL is reluctant to add such extra processing to handle “silly” special cases like this – the advice is to write the query better.

Upvotes: 1

Related Questions