pratpor
pratpor

Reputation: 2104

PostgreSQL default_statistics_target not improving row estimation

I am trying to optimize our queries on Postgres which takes minutes sometimes using huge tables. Started looking at query plan and noticed close 1000x difference between estimated number of rows and actual rows on running with EXPLAIN ANALYZE.

This lead me to the parameter default_statistics_target which controls the number rows sampled by ANALYZE command to collect stats used by query planner. As few blogs suggested, I experimented by increased value setting it to 1000 and event to max allowed value of 10000.

Ran ANALYZE every time to ensure it stats are updated. But surprisingly, this did not improve the rows estimation at all. In fact it reduced the estimated value a bit further which seems strange to understand.

Also tested by reducing the value to 10. Which seems to have improved the count a bit. So I am confused if the param actually does what I thought it does. Or if there is some other way to improve row estimation. Any help would be much appreciated.

Postgres version: 9.6

Query plan: At the last index scan step, it has estimated 462 but actual is 1.9M. https://explain.depesz.com/s/GZY

After changing default_statistics_target = 1000, rows at Index scan step were

->  (cost=0.57..120.42 rows=114 width=32) (actual time=248.999..157947.395 rows=1930518 loops=1)

And on setting it to default_statistics_target = 10, counts were:

->  (cost=0.57..2610.79 rows=2527 width=32) (actual time=390.437..62668.837 rows=1930518 loops=1)

P.S. Table under consideration has more than 100M rows.

Upvotes: 1

Views: 2426

Answers (1)

jjanes
jjanes

Reputation: 44157

This looks like a correlation problem. The planner assumes that the conditions on project_id, event_name_id, and "timestamp" are independent and so multiplies their estimated selectivity. If they are not independent, then no amount of traditional statistics is going to help that. Maybe you need extended statistics

Also, at the time it makes the plan it doesn't even know what value event_name_id will be compared to, as $0 is not determined until run time, so it can't use the value-specific statistics for that. You could execute the subquery manually, then hard code the resulting value into that spot in the query, so the planner knows what the value is while it is planning.

Upvotes: 2

Related Questions