alex88
alex88

Reputation: 4918

Improve join speed on timestamp range on postgresql

I've a query that joins two tables:

On table 1 I've an index on the timestamp column and on table 2 I've an index on start|end (in this order). Everything works fine and is fast when I simply join them like FROM d JOIN o ON d.timestamp > o.start_ts AND d.timestamp <= o.end_ts

The problem is that when an order is not completed, the end_ts is null so I need to change my join to d.timestamp <= COALESCE(o.end_ts, null), from the explain I can see that now the index scan includes a filter section that makes it way slower (from 1s to 14s).

The query I'm currently using is:

SELECT *
FROM data pd
JOIN orders o ON pd.timestamp > o.start_time AND pd.timestamp <= COALESCE(o.end_time, now())
WHERE pd.machine_id = 19 AND pd.name = 'somevalue' AND pd.timestamp > '2019-09-15 22:00:00' AND pd.timestamp <= '2019-09-22 21:59:59.999'

the explain is

Nested Loop  (cost=0.72..162190.37 rows=1647921 width=187) (actual time=2.589..18590.362 rows=7775 loops=1)
  Buffers: shared hit=44030111 read=3348
  I/O Timings: read=20.984
  ->  Index Scan using data_timestamp_machine_id_name_unique on data pd  (cost=0.43..20237.60 rows=1730 width=81) (actual time=0.055..50.357 rows=7713 loops=1)
        Index Cond: (("timestamp" > '2019-09-15 22:00:00-07'::timestamp with time zone) AND ("timestamp" <= '2019-09-22 21:59:59.999-07'::timestamp with time zone) AND (machine_id = 19) AND ((name)::text = 'weight'::text))
        Buffers: shared hit=5132 read=3216
        I/O Timings: read=20.591
  ->  Index Scan using orders_machine_id_idx on orders o  (cost=0.29..72.52 rows=953 width=106) (actual time=2.401..2.401 rows=1 loops=7713)
        Index Cond: ((machine_id = 19) AND (pd."timestamp" > start_time))
        Filter: (pd."timestamp" <= COALESCE(end_time, now()))
        Rows Removed by Filter: 7108
        Buffers: shared hit=44024979 read=132
        I/O Timings: read=0.393
Planning Time: 0.191 ms
Execution Time: 18591.568 ms

Upvotes: 0

Views: 657

Answers (3)

wildplasser
wildplasser

Reputation: 44250

You can avoid the special handling of NULLs by using NOT(negated_condition). (but the Infinity solution is cleaner,IMO)


SELECT *
FROM data pd
JOIN orders o ON pd.timestamp > o.start_time 
        AND NOT(pd.timestamp > o.end_time )
WHERE pd.machine_id = 19 
        AND pd.name = 'somevalue'
        AND pd.timestamp > '2019-09-15 22:00:00'
        AND pd.timestamp <= '2019-09-22 21:59:59.999'
        ;

Upvotes: 1

user330315
user330315

Reputation:

This is one of the cases where range types come in handy as they treat NULL values properly.

Create an index on the range:

CREATE INDEX orders_range ON orders using gist (tsrange(start_time,end_time,'(]'));

Then use the same range definition in the join condition

SELECT *
FROM data pd
JOIN orders o ON pd.timestamp <@ ts_range(o.start_time, o.end_time, '(]')
WHERE pd.machine_id = 19 
  AND pd.name = 'somevalue' 
  AND pd.timestamp > '2019-09-15 22:00:00' AND pd.timestamp <= '2019-09-22 21:59:59.999'

Unrelated, but: I would use pd.timestamp < '2019-09-22 22:00:00' instead of a <= operator with a timestamp that is a few milliseconds before 22:00:00

Upvotes: 2

alex88
alex88

Reputation: 4918

Thanks to incognito on IRC I've found a viable solution, create an index like this:

CREATE INDEX orders_timestamps ON orders (start_time,coalesce(end_time,'infinity'));

and change the query to pd.timestamp <= COALESCE(o.end_time, 'infinity'). This way I still have my nullable value and I have an indexed value that I can use, since I couldn't index the NOW() because it's variable

Upvotes: 0

Related Questions