Reputation: 915
PostgreSQL version 10
Windows 10
16GB RAM
SSD
I'm ashamed to admit that, despite searching the hundred years of PG support archives, I cannot figure out this most basic problem. But here it is...
I have big_table with 45 million rows and little_table with 12,000 rows. I need to do a left join to include all big_table rows, along with the id's of little_table rows where big_table's timestamp overlaps with two timestamps in little_table.
This doesn't seem like it should be an extreme operation for PG, but it is taking 2 1/2 hours!
Any ideas on what I can do here? Or do you think I have unwittingly come up against the limitations of my software/hardware combo given the table size?
Thanks!
CREATE TABLE public.little_table
(
id bigint,
start_time timestamp without time zone,
stop_time timestamp without time zone
);
CREATE INDEX idx_little_table
ON public.little_table USING btree
(start_time, stop_time DESC);
CREATE TABLE public.big_table
(
id bigint,
datetime timestamp without time zone
) ;
CREATE INDEX idx_big_table
ON public.big_table USING btree
(datetime);
explain analyze
select
bt.id as bt_id,
lt.id as lt_id
from
big_table bt
left join
little_table lt
on
(bt.datetime between lt.start_time and lt.stop_time)
Nested Loop Left Join (cost=0.29..3260589190.64 rows=64945831346 width=16) (actual time=0.672..9163998.367 rows=1374445323 loops=1)
-> Seq Scan on big_table bt (cost=0.00..694755.92 rows=45097792 width=16) (actual time=0.014..10085.746 rows=45098790 loops=1)
-> Index Scan using idx_little_table on little_table lt (cost=0.29..57.89 rows=1440 width=24) (actual time=0.188..0.199 rows=30 loops=45098790)
Index Cond: ((bt.datetime >= start_time) AND (bt.datetime <= stop_time))
Planning time: 0.165 ms
Execution time: 9199473.052 ms
NOTE: My actual query criteria is a bit more complex, but this seems to be the root of the problem. If I can fix this part, I think I can fix the rest.
Upvotes: 2
Views: 1123
Reputation: 44227
Generating 1.3 billion rows seems pretty extreme to me. How often do you need to do this, and how fast do you need it to be?
To explain a bit about your current plan:
Index Cond: ((bt.datetime >= start_time) AND (bt.datetime <= stop_time))
While it is not obvious from what is displayed above, this always scans about half the index. It starts at the beginning of the index, and stops once start_time > bt.datetime
, using bt.datetime <= stop_time
as an in-index filter that need to examine each row before rejecting it.
To flesh out Bergi's answer, you could do this:
alter table little_table add range tsrange;
update little_table set range =tsrange(start_time,stop_time,'[]');
create index on little_table using gist(range);
select
bt.id as bt_id,
lt.id as lt_id
from
big_table bt
left join
little_table lt
on
(bt.datetime <@ lt.range)
In my hands, that is about 4 times faster than your current method.
If your join did not need to do a left join, then you could get some more efficient operations by joining the tables in the opposite order. Perhaps you could get better performance by separating this into 2 operations, and inner join and then a probe for missing values, and combining the results.
Upvotes: 1
Reputation: 664620
I would suggest trying to change the start_time
and end_time
columns in the
little table to a single tsrange
column. According to the docs, this datatype supports a GIST index which can speed up the "range contains element" operator @>
. Maybe this will do better than the index scan on your current btree.
Upvotes: 2
Reputation: 246798
This query cannot perform any faster.
Since there is no equality operator (=
) in your join condition, the only strategy left to PostgreSQL is a nested loop join. 45 million repetitions of an index scan on the small table just take a while.
Upvotes: 2