Reputation: 447
===== Updated Based on Feedback =====
Due to some initial questions seeking clarification, here is a really simple version of this.
WITH my_var AS (
SELECT date '2016-01-01' as a_date
--, generate_series(1, 40) as numbers
)
Select generate_series(1, 100000) as numbers, my_var.a_date from my_var
execution time: 411ms
"CTE Scan on my_var (cost=0.01..5.03 rows=1000 width=4)"
" CTE my_var"
" -> Result (cost=0.00..0.01 rows=1 width=0)"
Now if we uncomment the the generate_series in the
WITH my_var AS (
SELECT date '2016-01-01' as a_date
, generate_series(1, 40) as numbers
)
Select generate_series(1, 100000) as numbers, my_var.a_date from my_var
execution time: 16201ms
"CTE Scan on my_var (cost=5.01..5022.51 rows=1000000 width=4)"
" CTE my_var"
" -> Result (cost=0.00..5.01 rows=1000 width=0)"
The point being here is if the generate_series(1, 40) is only supposed to be executed once, why is does it take so long for the query to complete. In this case, I didn't even use the 'number' set in the main query and it still took and extended amount of time to complete.
===== Original Inquiry =====
I've run into an interesting performance issue with PostgreSQL 9.x using sub-queries and/or CTE.
...And to completely honest, I'm not too sure if this is a "bug" or just an user (i.e. Me) understanding of CTE/Sub-queries and or use of the generate_series function.
I've been writing some advanced and longer queries using CTE. I've been using a technique where I've put a static variable such as dates into a master CTE which filters through all the additional queries. The idea is to make 1 set of changes rather than lots of them through a long query when you need to run it with different parameters.
An example of this is:
WITH dates AS (
SELECT
date '2013-01-01' AS start_date,
date_trunc('month', current_date) AS end_date
)
SELECT * from dates, sometable where somedate between start_date and end_date
execution time: ~650ms
So, it is my understanding that a CTE is run once, but after running into a performance issue, this is clearly not what is happening. For example, if I modify the CTE to include a generate_series:
WITH dates AS (
SELECT
date '2013-01-01' AS start_date,
date_trunc('month', current_date) AS end_date,
generate_series(1, 10) AS somelist
)
SELECT * from dates, sometable where somedate between start_date and end_date
and myval in (somelist)
execution time: ~23000ms
Due to some serious performance issues with this (thousands of times slower), I at first thought the generate_series() was assigning the somelist the "generate_series" function and then being executed as a subquery for every row in sometable in the main query. So to confirm this I modified the query is as follows:
WITH dates AS (
SELECT
date '2013-01-01' AS start_date,
date_trunc('month', current_date) AS end_date--,
--generate_series(1, 10) AS somelist
)
SELECT * from dates, sometable where somedate between start_date and end_date
and myval in (generate_series(1, 10))
execution time: ~700ms
To my surprise, this was relatively fast (and only 10% slower). The generate_series as as sub-query is clearly not the issue.
So then went to the back to the original query and just added the generate_series but never used it in the main query. here is that query.
WITH dates AS (
SELECT
date '2013-01-01' AS start_date,
date_trunc('month', current_date) AS end_date,
generate_series(1, 10) AS somelist
)
SELECT * from dates, sometable where somedate between start_date and end_date
execution time: ~23000ms
This is clearly the the smoking gun... but I have no idea why or for that matter what is really going. Here are my questions:
In Summary, the use of the generate_series within a CTE or sub-query is consuming huge amounts of time/resources (even if the result isn't being used). I get the same results in both Postgres v9.3 and v9.5. I'm table I running against has ~14 million rows. The result set is only about 275K.
I'm clueless at this point, does anyone have any theories? (... or is it a bug?)
Upvotes: 4
Views: 2255
Reputation: 44250
Experiment (I left out the dates, because they are just additional scalar constants)
EXPLAIN
WITH my_cte_b AS (
SELECT generate_series(1, 40) as b_number
)
, my_cte_c AS (
SELECT generate_series(1, 1000) AS c_number
)
Select
my_cte_b.b_number
, my_cte_c.c_number
FROM my_cte_b
JOIN my_cte_c ON (1=1)
;
Result:
QUERY PLAN
------------------------------------------------------------------
Nested Loop (cost=5.01..10020.01 rows=1000000 width=8)
CTE my_cte_b
-> Result (cost=0.00..2.50 rows=1000 width=0)
CTE my_cte_c
-> Result (cost=0.00..2.50 rows=1000 width=0)
-> CTE Scan on my_cte_b (cost=0.00..10.00 rows=1000 width=4)
-> CTE Scan on my_cte_c (cost=0.00..10.00 rows=1000 width=4)
(7 rows)
But EXPLAIN ANALYZE
gives the correct result:
-----------------------------
Nested Loop (cost=5.01..10020.01 rows=1000000 width=8) (actual time=0.029..8.953 rows=40000 loops=1)
CTE my_cte_b
-> Result (cost=0.00..2.50 rows=1000 width=0) (actual time=0.013..0.019 rows=40 loops=1)
CTE my_cte_c
-> Result (cost=0.00..2.50 rows=1000 width=0) (actual time=0.002..0.095 rows=1000 loops=1)
-> CTE Scan on my_cte_b (cost=0.00..10.00 rows=1000 width=4) (actual time=0.021..0.040 rows=40 loops=1)
-> CTE Scan on my_cte_c (cost=0.00..10.00 rows=1000 width=4) (actual time=0.000..0.104 rows=1000 loops=40)
Planning time: 0.042 ms
Execution time: 25.206 ms
(9 rows)
, So the problem appears to be in the estimate, not in the execution.
As a bonus: you can hint (or: fool) the planner by putting a LIMIT xx
in the CTEs:
EXPLAIN ANALYZE
WITH my_cte_b AS (
SELECT generate_series(1, 40) as b_number
LIMIT 40
)
, my_cte_c AS (
SELECT generate_series(1, 1000) AS c_number
LIMIT 10000
)
Select
my_cte_b.b_number
, my_cte_c.c_number
FROM my_cte_b
JOIN my_cte_c ON (1=1)
;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Nested Loop (cost=2.60..408.00 rows=40000 width=8) (actual time=0.019..9.347 rows=40000 loops=1)
CTE my_cte_b
-> Limit (cost=0.00..0.10 rows=40 width=0) (actual time=0.008..0.018 rows=40 loops=1)
-> Result (cost=0.00..2.50 rows=1000 width=0) (actual time=0.006..0.013 rows=40 loops=1)
CTE my_cte_c
-> Limit (cost=0.00..2.50 rows=1000 width=0) (actual time=0.002..0.241 rows=1000 loops=1)
-> Result (cost=0.00..2.50 rows=1000 width=0) (actual time=0.002..0.134 rows=1000 loops=1)
-> CTE Scan on my_cte_b (cost=0.00..0.40 rows=40 width=4) (actual time=0.012..0.036 rows=40 loops=1)
-> CTE Scan on my_cte_c (cost=0.00..10.00 rows=1000 width=4) (actual time=0.000..0.112 rows=1000 loops=40)
Planning time: 0.096 ms
Execution time: 10.693 ms
(11 rows)
My conclusion: the planner has no statistics on the CTEs (they dont contain any references to physical tables), and just makes a guess (1000). This guess can be overruled by putting a LIMIT inside the CTEs.
BTW: since PG-11 (or so) the restrinction (cte IS ALWAYS EXECUTED ONCE) has been removed. iff there are no side-effects.
Upvotes: 2