Improving efficiency of a self join in postgresql

Question

I am performing the following query with a self join:

with t as (
      SELECT *, TIMESTAMP 'epoch' + tstamp * INTERVAL '1 second' as tstamp2
      FROM
      mytable 
      WHERE id = 'a'
      LIMIT 1000
    )
select v1.id as id, date_trunc('hour', v1.tstamp2) as hour, v1.value as start, v2.value as stop 
from 
    t v1 join 
    t v2 
        on v1.id = v2.id and
        date_trunc('hour', v1.tstamp2) = date_trunc('hour', v2.tstamp2) and
        v1.tstamp2 < v2.tstamp2 
where 1=1
limit 100;

The table looks like that:

id   tstamp    value    tstamp2

My goal is to output all the combination of "value" within the same hour for one id. I have 100.000 unique ids and millions of rows. This is extremely slow and inefficient. Is there a way to break the query so the self join operates on time partitions (hour by hour for example) to improve speed of such query?

I have 100.000 unique ids and millions of rows.

EDIT: I found this which seems to be what I want to do but no idea how to implement that:

If you know more than you've let on about the properties of the intervals, you might be able to improve things. For instance if the intervals fall into nonoverlapping buckets then you could add a constraint that the buckets of the two sides are equal. Postgres is a lot better with equality join constraints than it is with range constraints, so it would be able to match up rows and only do the O(N^2) work within each bucket.

Gordon Linoff · Accepted Answer

This answers the question as originally tagged -- "Postgres", not "Redshift".

Unfortunately, Postgres materializes CTEs, which then precludes the use of indexes. You have no ORDER BY in the CTE, so arbitrary rows are being chosen.

One solution is a temporary table and indexes:

CREATE TEMPORARY TABLE t as
      SELECT t.*,
             TIMESTAMP 'epoch' + tstamp * INTERVAL '1 second' as tstamp2,
             DATE_TRUNC('hour', 'epoch' + tstamp * INTERVAL '1 second') as tstamp2_hour
      FROM mytable t
      WHERE t.id = 'a'
      LIMIT 1000;

CREATE INDEX t_id_hour_tstamp2 ON t(id, tstamp2_hour, tstamp2);

select v1.id as id, v1.tstamp2_hour as hour, v1.value as start, v2.value as stop 
from t v1 join 
     t v2 
        on v1.id = v2.id and
           v1.tstamp2_hour = v2.tstamp2_hour and
           v1.tstamp2 < v2.tstamp2 
limit 100;

Improving efficiency of a self join in postgresql

Answers (1)

Related Questions