partizaans
partizaans

Reputation: 305

WHERE clause is slower with value from CTE than with constant?

I want to have a variable cached during a query performing on Postgres 12. I followed the approach of having CTE like below:

-- BEGIN PART 1
with cached_vars as (
    select max(datetime) as datetime_threshold
    from locations
    where distance > 70
      and user_id = 9087
)
-- END PART 1
-- BEGIN PART 2
select *
from locations
where user_id = 9087
  and datetime > (select datetime_threshold from cached_vars)
-- END PART 2

Running the above query will lead to performance issues. I expected the total runtime to approximately equal (part1 runtime + part2 runtime), but it takes a lot longer.

Notably, there is no performance issue when I run only the 2nd part with manual datetime_threshold.

locations table is defined as:

 id | user_id | datetime | location | distance | ...
-----------------------------------------------------

Is there any way to reduce the total runtime to something like (part1 runtime + part2 runtime)?

Upvotes: 1

Views: 1546

Answers (5)

Vladimir Nikotin
Vladimir Nikotin

Reputation: 146

Try to change it to

with cached_vars as not materialized (

Visit https://dba.stackexchange.com/questions/257014/are-there-side-effects-to-postgres-12s-not-materialized-directive for explanation

Upvotes: 0

Just add Limi1 in the subquery as I used in below example.

-- BEGIN PART 1
with cached_vars as (
    select max(datetime) as datetime_threshold
    from locations
    where distance > 70
      and user_id = 9087
)
-- END PART 1
-- BEGIN PART 2
select *
from locations
where user_id = 9087
  and datetime > (select datetime_threshold from cached_vars Limit 1)
-- END PART 2

Upvotes: 0

Erwin Brandstetter
Erwin Brandstetter

Reputation: 657777

The explanation behind the difference you observed is this:

Postgres has column statistics and can adapt the query plan depending on the value of a provided constant for datetime_threshold. With favorable filter values, this can lead to a much more efficient query plan.

In the other case, when datetime_threshold has to be computed in another SELECT first, Postgres has to default to a generic plan. datetime_threshold could be anything.

The difference will become obvious in EXPLAIN output.

To make sure Postgres optimizes the second part for the actual datetime_threshold value, you can either run two separate queries (feed the result of query 1 as constant to query 2), or use dynamic SQL to force re-planning of query 2 every time in a PL/pgSQL function.

For example

CREATE OR REPLACE FUNCTION foo(_user_id int, _distance int = 70)
  RETURNS SETOF locations
  LANGUAGE plpgsql AS
$func$
BEGIN
   RETURN QUERY EXECUTE 
     'SELECT *
      FROM   locations
      WHERE  user_id = $1
      AND    datetime > $2'
   USING _user_id
      , (SELECT max(datetime)
         FROM   locations
         WHERE  distance > _distance
         AND    user_id = _user_id);
END
$func$;

Call:

SELECT * FROM foo(9087);

Related:

In extreme cases, you might even use another dynamic query to calculate datetime_threshold. But I don't expect that's necessary.

As for "something useful in the docs":

[...] The important difference is that EXECUTE will re-plan the command on each execution, generating a plan that is specific to the current parameter values; whereas PL/pgSQL may otherwise create a generic plan and cache it for re-use. In situations where the best plan depends strongly on the parameter values, it can be helpful to use EXECUTE to positively ensure that a generic plan is not selected.

Bold emphasis mine.

Indexes

Perfect indexes would be:

CREATE INDEX ON locations (user_id, distance DESC NULL LAST, date_time DESC NULLS LAST); -- for query 1
CREATE INDEX ON locations (user_id, date_time);           -- for query 2

Fine tuning depends on undisclosed details. Partial index might be an option.

There may be any number of additional reasons why your query is slow. Not enough details.

Upvotes: 3

Please break the query into two part and store the first part in a temp table (temporary table in PostgreSQL is only accessible in current database session.). Then join the temp table with second part. Hope it will speed up the processing time.

 CREATE TEMPORARY TABLE temp_table_cached_vars (
       datetime_threshold timestamp
    );
    
    -- BEGIN PART 1
    with cached_vars as (
        select max(datetime) as datetime_threshold
        from locations
        where distance > 70
          and user_id = 9087
    )insert into temp_table_name select datetime_threshold from cached_vars 
    -- END PART 1
    -- BEGIN PART 2
    select *
    from locations
    where user_id = 9087
      and datetime > (select datetime_threshold from temp_table_cached_vars Limit 1)

-- END PART 2

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1270371

If you want your query to perform well, I would suggest adding the indexes locations(user_id, distance) and locations(user_id, datetime).

I would also phrase the query using window functions:

select l.*
from (select l.*,
             max(datetime) filter (where distance > 70) over (partition by userid) as datetime_threshold
      from location l
      where userid = 9087
     ) l
where datetime > datetime_threshold;

Window functions often improve performance. However, with the right indexes, I don't know if the two versions will be substantially different.

Upvotes: 0

Related Questions