Reputation: 915

How can I optimize this join on timestamps in PostgreSQL

PostgreSQL version 10
Windows 10
16GB RAM
SSD

I'm ashamed to admit that, despite searching the hundred years of PG support archives, I cannot figure out this most basic problem. But here it is...
I have big_table with 45 million rows and little_table with 12,000 rows. I need to do a left join to include all big_table rows, along with the id's of little_table rows where big_table's timestamp overlaps with two timestamps in little_table.
This doesn't seem like it should be an extreme operation for PG, but it is taking 2 1/2 hours!
Any ideas on what I can do here? Or do you think I have unwittingly come up against the limitations of my software/hardware combo given the table size?

Thanks!

little_table with 12,000 rows

CREATE TABLE public.little_table
(
    id bigint,
    start_time timestamp without time zone,
    stop_time timestamp without time zone
);


CREATE INDEX idx_little_table
    ON public.little_table USING btree
    (start_time, stop_time DESC);

big_table with 45 million rows

CREATE TABLE public.big_table
(
    id bigint,
    datetime timestamp without time zone
) ;

CREATE INDEX idx_big_table
    ON public.big_table USING btree
    (datetime);

Query

explain analyze
select
  bt.id as bt_id,
  lt.id as lt_id
from
    big_table bt
left join
  little_table lt
on
    (bt.datetime between lt.start_time and  lt.stop_time)

Explain Results

Nested Loop Left Join  (cost=0.29..3260589190.64 rows=64945831346 width=16) (actual time=0.672..9163998.367 rows=1374445323 loops=1)
  ->  Seq Scan on big_table bt  (cost=0.00..694755.92 rows=45097792 width=16) (actual time=0.014..10085.746 rows=45098790 loops=1)
  ->  Index Scan using idx_little_table on little_table lt  (cost=0.29..57.89 rows=1440 width=24) (actual time=0.188..0.199 rows=30 loops=45098790)
        Index Cond: ((bt.datetime >= start_time) AND (bt.datetime <= stop_time))
Planning time: 0.165 ms
Execution time: 9199473.052 ms

NOTE: My actual query criteria is a bit more complex, but this seems to be the root of the problem. If I can fix this part, I think I can fix the rest.

Upvotes: 2

Answers (3)

jjanes

Reputation: 44227

Generating 1.3 billion rows seems pretty extreme to me. How often do you need to do this, and how fast do you need it to be?

To explain a bit about your current plan:

Index Cond: ((bt.datetime >= start_time) AND (bt.datetime <= stop_time))

While it is not obvious from what is displayed above, this always scans about half the index. It starts at the beginning of the index, and stops once start_time > bt.datetime, using bt.datetime <= stop_time as an in-index filter that need to examine each row before rejecting it.

To flesh out Bergi's answer, you could do this:

alter table little_table add range tsrange;
update little_table set range =tsrange(start_time,stop_time,'[]');
create index on little_table using gist(range);

select
  bt.id as bt_id,
  lt.id as lt_id
from
    big_table bt
left join
  little_table lt
on
    (bt.datetime <@ lt.range)

In my hands, that is about 4 times faster than your current method.

If your join did not need to do a left join, then you could get some more efficient operations by joining the tables in the opposite order. Perhaps you could get better performance by separating this into 2 operations, and inner join and then a probe for missing values, and combining the results.

Upvotes: 1

Bergi

Reputation: 664620

I would suggest trying to change the start_time and end_time columns in the little table to a single tsrange column. According to the docs, this datatype supports a GIST index which can speed up the "range contains element" operator @>. Maybe this will do better than the index scan on your current btree.

Upvotes: 2

Laurenz Albe

Reputation: 246798

This query cannot perform any faster.

Since there is no equality operator (=) in your join condition, the only strategy left to PostgreSQL is a nested loop join. 45 million repetitions of an index scan on the small table just take a while.

Upvotes: 2

How can I optimize this join on timestamps in PostgreSQL

little_table with 12,000 rows

big_table with 45 million rows

Query

Explain Results

Answers (3)

Related Questions