left join causes huge increase in time for query resolution

Question

The following query helps me to calculate the average of historical values distributed on even time intervals.

EXPLAIN ANALYZE SELECT start_date as date, AVG(hcv1.value::float) as value
FROM generate_series(cast('2017-01-01' as abstime), cast('2017-12-01' as abstime), interval '86400 seconds') start_date
LEFT JOIN history_values hv
ON (
    hv.variable_id = 3 AND 
    hv.created_at BETWEEN start_date AND start_date + interval '86400 seconds'
)
GROUP BY start_date 
ORDER BY start_date

Here the report of the query: https://explain.depesz.com/s/q29a

Now if I try to add an extra column value2 pointing to another variable_id the query time goes from 2 seconds to 150 seconds:

EXPLAIN ANALYZE SELECT start_date as date,
AVG(hv1.value::float) as value1,
AVG(hv2.value::float) as value2
FROM generate_series(cast('2017-01-01' as abstime), cast('2017-12-01' as abstime), interval '86400 seconds') start_date
LEFT JOIN history_values hv1
ON (
    hv1.variable_id = 2 AND 
    hv.created_at BETWEEN start_date AND start_date + interval '86400 seconds'
)
LEFT JOIN history_values hv2
ON (
    hv2.variable_id = 3 AND 
    hv.created_at BETWEEN start_date AND start_date + interval '86400 seconds'
)
GROUP BY start_date 
ORDER BY start_date

Here is the report: https://explain.depesz.com/s/V1sV

Could anybody tell me why? I was really expecting the time to be around 4 seconds, not almost 75 times more.

Also note that:

SELECT COUNT(*) FROM history_values WHERE variable_id = 2 -- ~25k records
SELECT COUNT(*) FROM history_values WHERE variable_id = 3 -- ~25k records

Evan Carroll · Accepted Answer

You're not adding an extra column, you're adding another join condition. And you don't need that extra join anyway..

Try instead, just filtering the avg()

EXPLAIN ANALYZE
SELECT start_date as date,
  AVG(hv1.value::float) FILTER ( WHERE hv1.variable_id = 1 ) as value1,
  AVG(hv2.value::float) FILTER ( WHERE hv1.variable_id = 2 ) as value2  
FROM generate_series(
  cast('2017-01-01' as abstime)
  , cast('2017-12-01' as abstime),
  , interval '86400 seconds'
) AS start_date
LEFT JOIN history_values hv1
ON (
  hv1.created_at >= cast('2017-01-01' as abstime) AND
  hv1.created_at <= cast('2017-12-01' as abstime) AND
  hv1.created_at >= start_date AND 
  hv1.created_at < start_date + interval '86400 seconds'
)
GROUP BY start_date 
ORDER BY start_date

As a side note, you should not ever be using abstime. That should be for internal use only. Instead, I would use

EXPLAIN ANALYZE
SELECT start_date::date AS date,
  AVG(hv1.value::float) FILTER ( WHERE hv1.variable_id = 1 ) as value1,
  AVG(hv2.value::float) FILTER ( WHERE hv1.variable_id = 2 ) as value2
FROM generate_series(
  timestamp with time zone '2017-01-01',
  timestamp with time zone '2017-12-01',
  interval '1 day'
) AS start_date
LEFT JOIN history_values hv1
ON (
  hv1.created_at BETWEEN (
    timestamp with time zone '2017-01-01'
    AND timestamp with time zone '2017-12-01'
  ) AND
  hv1.created_at >= start_date AND 
  hv1.created_at < start_date + interval '1 day' AND
  hv1.variable_id IN (1,2)

)
GROUP BY start_date 
ORDER BY start_date

I would also think you could collapse those ranges down..

EXPLAIN ANALYZE
SELECT start_date::date AS date,
  AVG(hv1.value::float) FILTER ( WHERE hv1.variable_id = 1 ) as value1,
  AVG(hv2.value::float) FILTER ( WHERE hv1.variable_id = 2 ) as value2
FROM generate_series(
  timestamp with time zone '2017-01-01',
  timestamp with time zone '2017-12-01' - interval '1 day'
  interval '1 day'
) AS start_date
LEFT JOIN history_values hv1
  ON hv1.created_at BETWEEN start_date AND (start_date + interval '1 day' )
  AND hv1.variable_id IN (1,2)
GROUP BY start_date 
ORDER BY start_date

In the future, please ask questions specific to PostgreSQL on http://dba.stackexchange.com. I would flag this for migration there. The admins will gladly move it.

left join causes huge increase in time for query resolution

Answers (1)

Related Questions