PostgreSQL CTE/Sub-Query with generate_series performance issues

Question

===== Updated Based on Feedback =====

Due to some initial questions seeking clarification, here is a really simple version of this.

WITH my_var AS ( 
    SELECT date '2016-01-01' as a_date          
    --, generate_series(1, 40) as numbers       
    )
Select generate_series(1, 100000) as numbers, my_var.a_date from my_var 

execution time: 411ms

"CTE Scan on my_var  (cost=0.01..5.03 rows=1000 width=4)"
"  CTE my_var"
"    ->  Result  (cost=0.00..0.01 rows=1 width=0)"

Now if we uncomment the the generate_series in the

WITH my_var AS ( 
    SELECT date '2016-01-01' as a_date          
    , generate_series(1, 40) as numbers     
)
Select generate_series(1, 100000) as numbers, my_var.a_date from my_var 

execution time: 16201ms

"CTE Scan on my_var  (cost=5.01..5022.51 rows=1000000 width=4)"
"  CTE my_var"
"    ->  Result  (cost=0.00..5.01 rows=1000 width=0)"

The point being here is if the generate_series(1, 40) is only supposed to be executed once, why is does it take so long for the query to complete. In this case, I didn't even use the 'number' set in the main query and it still took and extended amount of time to complete.

===== Original Inquiry =====

I've run into an interesting performance issue with PostgreSQL 9.x using sub-queries and/or CTE.

...And to completely honest, I'm not too sure if this is a "bug" or just an user (i.e. Me) understanding of CTE/Sub-queries and or use of the generate_series function.

I've been writing some advanced and longer queries using CTE. I've been using a technique where I've put a static variable such as dates into a master CTE which filters through all the additional queries. The idea is to make 1 set of changes rather than lots of them through a long query when you need to run it with different parameters.

An example of this is:

WITH dates AS ( 
   SELECT 
      date '2013-01-01' AS start_date, 
      date_trunc('month', current_date) AS end_date
)
SELECT * from dates, sometable where somedate between start_date and end_date

execution time:  ~650ms

So, it is my understanding that a CTE is run once, but after running into a performance issue, this is clearly not what is happening. For example, if I modify the CTE to include a generate_series:

WITH dates AS ( 
   SELECT 
      date '2013-01-01' AS start_date, 
      date_trunc('month', current_date) AS end_date,
      generate_series(1, 10) AS somelist
)
SELECT * from dates, sometable where somedate between start_date and end_date 
   and myval in (somelist)

execution time:  ~23000ms

Due to some serious performance issues with this (thousands of times slower), I at first thought the generate_series() was assigning the somelist the "generate_series" function and then being executed as a subquery for every row in sometable in the main query. So to confirm this I modified the query is as follows:

WITH dates AS ( 
   SELECT 
      date '2013-01-01' AS start_date, 
      date_trunc('month', current_date) AS end_date--,
      --generate_series(1, 10) AS somelist
)
SELECT * from dates, sometable where somedate between start_date and end_date 
   and myval in (generate_series(1, 10))


execution time:  ~700ms

To my surprise, this was relatively fast (and only 10% slower). The generate_series as as sub-query is clearly not the issue.

So then went to the back to the original query and just added the generate_series but never used it in the main query. here is that query.

WITH dates AS ( 
   SELECT 
      date '2013-01-01' AS start_date, 
      date_trunc('month', current_date) AS end_date,
      generate_series(1, 10) AS somelist
)
SELECT * from dates, sometable where somedate between start_date and end_date

execution time:  ~23000ms

This is clearly the the smoking gun... but I have no idea why or for that matter what is really going. Here are my questions:

In Summary, the use of the generate_series within a CTE or sub-query is consuming huge amounts of time/resources (even if the result isn't being used). I get the same results in both Postgres v9.3 and v9.5. I'm table I running against has ~14 million rows. The result set is only about 275K.

I'm clueless at this point, does anyone have any theories? (... or is it a bug?)

PostgreSQL CTE/Sub-Query with generate_series performance issues

Answers (1)

Related Questions