Mike Koltsov
Mike Koltsov

Reputation: 3

How is `rows` param in `explain analyze` (postgresql) estimated in case of a function scan?

I was checking execution plans for parts of a complicated query and came up with this:

postgres=# explain analyze                                                                                                                                                
select * from generate_series(
            (CURRENT_DATE)::timestamp without time zone,
            (CURRENT_DATE + '14 days'::interval),
            '1 day'::interval)
;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series  (cost=0.01..10.01 rows=1000 width=8) (actual time=0.024..0.036 rows=15 loops=1)
 Planning Time: 0.031 ms
 Execution Time: 0.064 ms
(3 rows)


AFAIK, postgresql estimates rows based on reltuples size for a given table, that's understandable.

Given that the mentioned generate_series actually generates 14 rows, where does the rows=1000 come from in case of a function scan?

Upvotes: 0

Views: 359

Answers (4)

wildplasser
wildplasser

Reputation: 44230

As an obvious workaround, you could fool the planner by wrapping the query in a subquery with a LIMIT:


select * FROM (
        select * from generate_series(
            (CURRENT_DATE)::timestamp without time zone,
            (CURRENT_DATE + '14 days'::interval),
            '1 day'::interval)
        LIMIT 15;
        ) xxx
    ;

Upvotes: 0

Belayer
Belayer

Reputation: 14861

What's the point. This is a very clear instance of premature optimization. In this case the result is back before before the brain realizes you "pressed" run.

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

Donald Knuth, The Art of Computer Programming, 1962.

It would appear it's at least as big a problem today as then.

Upvotes: 0

jjanes
jjanes

Reputation: 44137

The generate_series functions which take number arguments has a "support function" which will peek at the arguments and then tell the planner how many rows to expect. But the ones that deal with time stamps do not have such support functions. Instead it just estimates that it returns prorows rows, which defaults to 1000.

You could change this estimate if you want:

alter function generate_series(timestamp without time zone, timestamp without time zone, interval) 
    rows 14;

But this change will not survive pg_upgrade, nor dump/reload.

This is version specific, as support functions were only implemented in v12. Before that even the number-taking forms always planned on 1000 rows (or whatever prorows was set to for that function).

Upvotes: 0

pifor
pifor

Reputation: 7882

According to the documentation:

For those interested in further details, estimation of the size of a table (before any WHERE clauses) is done in src/backend/optimizer/util/plancat.c. The generic logic for clause selectivities is in src/backend/optimizer/path/clausesel.c. The operator-specific selectivity functions are mostly found in src/backend/utils/adt/selfuncs.c.

this is the function that computes estimations for functions:

/*
  * function_selectivity
  *
  * Returns the selectivity of a specified boolean function clause.
  * This code executes registered procedures stored in the
  * pg_proc relation, by calling the function manager.
  *
  * See clause_selectivity() for the meaning of the additional parameters.
  */
 Selectivity
 function_selectivity(PlannerInfo *root,
                      Oid funcid,
                      List *args,
                      Oid inputcollid,
                      bool is_join,
                      int varRelid,
                      JoinType jointype,
                      SpecialJoinInfo *sjinfo)
 {

It looks like this C function will read data in pg_proc system catalog where we have:

postgres=# select proname, prosupport, prorows 
           from pg_proc 
           where proname like '%generate%';
           proname            |          prosupport          | prorows 
------------------------------+------------------------------+---------
 generate_subscripts          | -                            |    1000
 generate_subscripts          | -                            |    1000
 generate_series              | generate_series_int4_support |    1000
 generate_series              | generate_series_int4_support |    1000
 generate_series_int4_support | -                            |       0
 generate_series              | generate_series_int8_support |    1000
 generate_series              | generate_series_int8_support |    1000
 generate_series_int8_support | -                            |       0
 generate_series              | -                            |    1000
 generate_series              | -                            |    1000
 generate_series              | -                            |    1000
 generate_series              | -                            |    1000
(12 rows)

It looks like the pg_proc.prorows column is the retrieved estimation.

Upvotes: 1

Related Questions