Reputation: 85
I got an issue with Postgresql greatly underestimating number of rows returned by a join between CTE and a table. The xfactor is about 42 and I think that this leads Postgres to choose nested loop.
As you can see in EXPLAIN ANALYZE
output, aggregating consume nearly 100% of the total time.
The most interesting thing is that I've tried to add another query level to perform the sum afterwards and it is decreasing time from 33s to 6s.
I think that nested-loop is joining table in such a bad way that aggregating the output is a pain for Postgres. Or maybe I'm missing something important ?
Tables (each one containing 50k rows) :
CREATE TABLE IF NOT EXISTS public.forme_iris
(
code_iris character varying(20) COLLATE pg_catalog."default" NOT NULL,
geometry geometry,
CONSTRAINT forme_iris_pkey PRIMARY KEY (code_iris)
)
CREATE TABLE IF NOT EXISTS public.dl_rawdata_recensement
(
iris character varying(200)
pop_0_14_ans_en_2016_princ character varying(223),
pop_15_29_ans_en_2016_princ character varying(224),
pop_30_44_ans_en_2016_princ character varying(225),
pop_45_59_ans_en_2016_princ character varying(226),
pop_60_74_ans_en_2016_princ character varying(220),
pop_75_ans_ou_plus_en_2016_princ character varying(228),
CONSTRAINT dl_rawdata_recensement_pkey PRIMARY KEY (iris)
)
Covering Index :
CREATE INDEX dl_rawdata_recensement_cvg_idx1
ON public.dl_rawdata_recensement USING btree
(iris COLLATE pg_catalog."default" ASC NULLS LAST)
INCLUDE(pop_0_14_ans_en_2016_princ, pop_15_29_ans_en_2016_princ, pop_30_44_ans_en_2016_princ, pop_45_59_ans_en_2016_princ, pop_60_74_ans_en_2016_princ, pop_75_ans_ou_plus_en_2016_princ, pop_15_ans_ou_plus_hommes_en_2016_cplt, pop_15_ans_ou_plus_femmes_en_2016_cplt, pop_0_14_ans_en_2016_princ)
TABLESPACE pg_default;
Whole query:
WITH cte
AS (
SELECT ST_SetSrid(ST_GeomFromGeoJson('{geometry}'), 4326) AS geom
)
SELECT SUM(pourcentage * (pop_0_14_ans_en_2016_princ + pop_15_29_ans_en_2016_princ + pop_30_44_ans_en_2016_princ + pop_45_59_ans_en_2016_princ + pop_60_74_ans_en_2016_princ + pop_75_ans_ou_plus_en_2016_princ)) total_population,
SUM(pourcentage * pop_0_14_ans_en_2016_princ) total_0_14,
SUM(pourcentage * pop_15_29_ans_en_2016_princ) total_15_29,
SUM(pourcentage * pop_30_44_ans_en_2016_princ) total_30_44,
SUM(pourcentage * pop_45_59_ans_en_2016_princ) total_45_59,
SUM(pourcentage * pop_60_74_ans_en_2016_princ) total_60_74,
SUM(pourcentage * pop_75_ans_ou_plus_en_2016_princ) total_75
FROM (SELECT (ST_Area(ST_Intersection(iris.geometry, cte.geom)) / ST_Area(iris.geometry)) AS pourcentage,
code_iris
FROM forme_iris iris
JOIN cte
ON ST_Intersects(iris.geometry, cte.geom)) AS a
JOIN public.dl_rawdata_recensement population
ON population.iris = a.code_iris
Subquery concerned by the nested-loop:
SELECT (ST_Area(ST_Intersection(iris.geometry, cte.geom)) / ST_Area(iris.geometry)) AS pourcentage,
code_iris
FROM forme_iris iris
JOIN cte
ON ST_Intersects(iris.geometry, cte.geom)
I've vacuumed + analyzed both table before performing the request. Is Postgres still missing some statistics ?
PostgreSQL version: PostgreSQL 11.12 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit
Postgis version: 2.5 USE_GEOS=1 USE_PROJ=1 USE_STATS=1
EXPLAIN ANALYZE
: https://explain.depesz.com/s/sjON
Thanks !
Upvotes: 0
Views: 578
Reputation: 44137
It looks like bottleneck is in postgis, in the calculation of ST_Area and ST_Intersection with your presumably very large and complex literal geography value. I don't think the misestimate has anything to do with it--those calculations will be done the same way no matter how many rows it thinks there will be.
Perhaps one way of writing the query causes those calculations to be done over and over, and another way of writing doesn't. But without seeming both of those queries at the same time, it is hard to say more. And I don't think PostgreSQL's decision to re-compute versus re-use is driven by row estimates.
Upvotes: 1