Reputation: 476
Let's say I have two tables table1
and table2
, with three columns each, id
, time
, value
. They store the same kind of information, ie. a 30-minutes timeseries data for several ids (let's imagine a machine that produces an amount of energy per day). table2
contains more precise information than table1
, but not for all timestamp nor all ids.
I want to get the best out of the two tables over the period defined by table2
, ie. storing data from table2
when available, and discarding to table1
when required (to add some more complexity, let's say that table1
is not a real table, but rather a view that takes a hell lot of time to be fully computed, so that I want to avoid compute it in its integrality).
I thought I could define a perimeter of id-time to indicate which value should be kept each day (teh daily scale should be equivalent to the 30-minutes timestamp, and be less resource-consuming). Thus I went for :
with perimeter_per_day_table2 as (
select distinct
id,
date_trunc('day', time) as day
from table2
),
perimeter_per_day_table1 as (
select id,
date_trunc('day', time) as day,
from table1
where day >= (select min(time) from table2)
and day <= (select max(time) from table2)
and (id, day) not in (select id, day from perimeter_per_day_table2)
)
select * from perimeter_per_day_table1
but that takes a hell lot of time. In particular, it seems like the condition where (id, day) not in (select id, day from perimeter_per_day_table2)
is very hard for Potsgresql to handle.
Any suggestion ?
Upvotes: 0
Views: 61
Reputation:
Indeed, NOT IN
isn't optimized as well as NOT EXIST
in Postgres. So an equivalent not exists ()
condition is typically faster.
However, in neither case do you need to apply a (costly) DISTINCT on the rows in the sub-query.
with perimeter_per_day_table1 as (
select t1.id,
date_trunc('day', t1.time) as day
from table1 t1
where t1.day >= (select min(time) from table2)
and t1.day <= (select max(time) from table2)
and not exists (select *
from table2 t2
where t1.id = t2.id
and t1.day = t2.day)
)
select *
from perimeter_per_day_table1;
You can even avoid querying table2 twice for the min/max, but I doubt that will make a huge difference if there is an index on the time
column:
with min_max as (
select min(time) as min_time,
max(time) as max_time
from table2
), perimeter_per_day_table1 as (
select t1.id,
date_trunc('day', t1.time) as day
from table1 t1
cross join min_max
where t1.day >= min_max.min_time
and t1.day <= min_max.max_time
and not exists (select *
from table2 t2
where t1.id = t2.id
and t1.day = t2.day)
)
select *
from perimeter_per_day_table1;
Upvotes: 1