Reputation: 75
I am using DuckDB, which should follows postgresql's syntax for the most part.
t_cons
(~200k) and t_flows
(~1.2M).t_cons
there are many t_flows
that match asset, year, rep_period
t_cons
, storing all t_flows.variable_index
that satisfy operation > 0.I managed to get the result with two queries, one for an INNER JOIN computing the operation and filtering and the second to aggregate, but it is too slow.
I am trying to join them in a single query and remove the inner join to prevent the scan, but I'm not that proficient with SQL. Any help is much appreciated.
The link for the sql and a sample of the data:
https://gist.github.com/abelsiqueira/d8124d6b32db8e20297be4440405e093
CREATE
OR REPLACE TABLE t_flows AS
SELECT
*
FROM
't_flows.csv';
CREATE
OR REPLACE TABLE t_cons AS
SELECT
*
FROM
't_cons.csv';
CREATE
OR REPLACE TABLE t_incoming_nonzero AS
SELECT
t_flows.from_asset,
t_flows.to_asset,
t_flows.year,
t_flows.rep_period,
t_flows.variable_index,
t_cons.time_block_start,
greatest(
0,
1 + least(t_flows.time_block_end, t_cons.time_block_end) - greatest(t_flows.time_block_start, t_cons.time_block_start)
) AS duration
FROM
t_flows
INNER JOIN t_cons ON t_flows.to_asset = t_cons.asset
AND t_flows.year = t_cons.year
AND t_flows.rep_period = t_cons.rep_period
WHERE
duration > 0;
CREATE
OR REPLACE TABLE t_result AS
SELECT
t_cons.asset,
t_cons.year,
t_cons.rep_period,
t_cons.time_block_start,
COALESCE(indices, []) AS indices,
COALESCE(durations, []) AS durations,
FROM
t_cons
LEFT JOIN (
SELECT
to_asset AS asset,
YEAR,
rep_period,
time_block_start,
array_agg(variable_index) AS indices,
array_agg(duration) AS durations,
FROM
t_incoming_nonzero
GROUP BY
to_asset,
YEAR,
rep_period,
time_block_start
) AS t_tmp ON t_cons.asset = t_tmp.asset
AND t_cons.year = t_tmp.year
AND t_cons.rep_period = t_tmp.rep_period
AND t_cons.time_block_start = t_tmp.time_block_start;
SELECT
*
FROM
t_result;
Output of "EXPLAIN ANALYZE" (DuckDB doesn't support the other settings for explain):
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ( ANALYZE ) CREATE OR REPLACE TABLE t_incoming_nonzero AS SELECT t_flows.from_asset, t_flows.to_asset, t_flows.year, t_flows.rep_period, t_flows.variable_index, t_cons.time_block_start, greatest( 0, 1 + least(t_flows.time_block_end, t_cons.time_block_end) - greatest(t_flows.time_block_start, t_cons.time_block_start) ) AS duration FROM t_flows INNER JOIN t_cons ON t_flows.to_asset = t_cons.asset AND t_flows.year = t_cons.year AND t_flows.rep_period = t_cons.rep_period WHERE duration > 0;
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││ Total Time: 20.11s ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│ QUERY │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ──────────────────── │
│ 0 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ CREATE_TABLE_AS │
│ ──────────────────── │
│ 1 Rows │
│ (0.13s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│ from_asset │
│ to_asset │
│ year │
│ rep_period │
│ variable_index │
│ time_block_start │
│ duration │
│ │
│ 1372400 Rows │
│ (0.02s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│ #0 │
│ #1 │
│ #2 │
│ #3 │
│ #4 │
│ #5 │
│ #6 │
│ #10 │
│ #11 │
│ │
│ 1372400 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ──────────────────── │
│ (greatest(0, ((1 + least │
│ (time_block_end, │
│ time_block_end)) - │
│ greatest(time_block_start│
│ , time_block_start))) > 0)│
│ │
│ 1372400 Rows │
│ (78.74s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_JOIN │
│ ──────────────────── │
│ Join Type: INNER │
│ │
│ Conditions: │
│ rep_period = rep_period ├──────────────┐
│ year = year │ │
│ to_asset = asset │ │
│ │ │
│ 5184051200 Rows │ │
│ (49.01s) │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ TABLE_SCAN ││ TABLE_SCAN │
│ ──────────────────── ││ ──────────────────── │
│ t_flows ││ t_cons │
│ ││ │
│ Projections: ││ Projections: │
│ to_asset ││ asset │
│ year ││ year │
│ rep_period ││ rep_period │
│ time_block_end ││ time_block_end │
│ time_block_start ││ time_block_start │
│ from_asset ││ │
│ variable_index ││ │
│ ││ │
│ 1335900 Rows ││ 220460 Rows │
│ (0.04s) ││ (0.00s) │
└───────────────────────────┘└───────────────────────────┘
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ( ANALYZE ) CREATE OR REPLACE TABLE t_result AS SELECT t_cons.asset, t_cons.year, t_cons.rep_period, t_cons.time_block_start, COALESCE(indices, []) AS indices, COALESCE(durations, []) AS durations, FROM t_cons LEFT JOIN ( SELECT to_asset AS asset, YEAR, rep_period, time_block_start, array_agg(variable_index) AS indices, array_agg(duration) AS durations, FROM t_incoming_nonzero GROUP BY to_asset, YEAR, rep_period, time_block_start ) AS t_tmp ON t_cons.asset = t_tmp.asset AND t_cons.year = t_tmp.year AND t_cons.rep_period = t_tmp.rep_period AND t_cons.time_block_start = t_tmp.time_block_start;
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││ Total Time: 0.0560s ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│ QUERY │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ──────────────────── │
│ 0 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ CREATE_TABLE_AS │
│ ──────────────────── │
│ 1 Rows │
│ (0.01s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│ asset │
│ year │
│ rep_period │
│ time_block_start │
│ indices │
│ durations │
│ │
│ 220460 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_JOIN │
│ ──────────────────── │
│ Join Type: LEFT │
│ │
│ Conditions: │
│ asset = asset │
│ year = year ├──────────────┐
│ rep_period = rep_period │ │
│ time_block_start = │ │
│ time_block_start │ │
│ │ │
│ 220460 Rows │ │
│ (0.04s) │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ TABLE_SCAN ││ PROJECTION │
│ ──────────────────── ││ ──────────────────── │
│ t_cons ││__internal_decompress_strin│
│ ││ g(#0) │
│ Projections: ││__internal_decompress_integ│
│ asset ││ ral_bigint(#1, 2030) │
│ year ││__internal_decompress_integ│
│ rep_period ││ ral_bigint(#2, 1) │
│ time_block_start ││__internal_decompress_integ│
│ ││ ral_bigint(#3, 1) │
│ ││ #4 │
│ ││ #5 │
│ ││ │
│ 220460 Rows ││ 220460 Rows │
│ (0.00s) ││ (0.00s) │
└───────────────────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY │
│ ──────────────────── │
│ Groups: │
│ #0 │
│ #1 │
│ #2 │
│ #3 │
│ │
│ Aggregates: │
│ array_agg(#4) │
│ array_agg(#5) │
│ │
│ 220460 Rows │
│ (0.15s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│ to_asset │
│ year │
│ rep_period │
│ time_block_start │
│ variable_index │
│ duration │
│ │
│ 1372400 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│__internal_compress_string_│
│ hugeint(#0) │
│__internal_compress_integra│
│ l_utinyint(#1, 2030) │
│__internal_compress_integra│
│ l_utinyint(#2, 1) │
│__internal_compress_integra│
│ l_usmallint(#3, 1) │
│ #4 │
│ #5 │
│ │
│ 1372400 Rows │
│ (0.01s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ TABLE_SCAN │
│ ──────────────────── │
│ t_incoming_nonzero │
│ │
│ Projections: │
│ to_asset │
│ year │
│ rep_period │
│ time_block_start │
│ variable_index │
│ duration │
│ │
│ 1372400 Rows │
│ (0.01s) │
└───────────────────────────┘
Example input and expected output:
CREATE
OR REPLACE TABLE t_flows (
from_asset text,
to_asset text,
YEAR int,
rep_period int,
time_block_start int,
time_block_end int,
variable_index int
);
--
INSERT INTO
t_flows
VALUES
('producer1', 'hub', 2030, 1, 1, 3, 1),
('producer1', 'hub', 2030, 1, 4, 6, 2),
('producer1', 'hub', 2030, 1, 7, 9, 3),
('producer1', 'hub', 2030, 1, 10, 12, 4),
('hub', 'demand', 2030, 1, 1, 4, 5),
('hub', 'demand', 2030, 1, 5, 8, 6),
('hub', 'demand', 2030, 1, 9, 12, 7);
--
CREATE
OR REPLACE TABLE t_cons (
asset text,
YEAR int,
rep_period int,
time_block_start int,
time_block_end int
);
--
INSERT INTO
t_cons
VALUES
('hub', 2030, 1, 1, 5),
('hub', 2030, 1, 6, 6),
('hub', 2030, 1, 7, 8),
('hub', 2030, 1, 9, 12);
┌─────────┬───────┬────────────┬──────────────────┬─────────┬───────────┐
│ asset │ YEAR │ rep_period │ time_block_start │ indices │ durations │
│ varchar │ int32 │ int32 │ int32 │ int32[] │ int32[] │
├─────────┼───────┼────────────┼──────────────────┼─────────┼───────────┤
│ hub │ 2030 │ 1 │ 1 │ [1, 2] │ [3, 2] │
│ hub │ 2030 │ 1 │ 6 │ [2] │ [1] │
│ hub │ 2030 │ 1 │ 7 │ [3] │ [2] │
│ hub │ 2030 │ 1 │ 9 │ [3, 4] │ [1, 3] │
└─────────┴───────┴────────────┴──────────────────┴─────────┴───────────┘
Upvotes: 1
Views: 94
Reputation: 5916
I suggest considering a more compact query.
Temporary tables are not created.
If you need assets and periods for which there is no data, try replacing INNER JOIN with LEFT JOIN.
You should create indexes to improve performance. Perhaps index on t_cons(asset, YEAR, rep_period, time_block_start) and on t_flow(to_asset,year,rep_period).
SELECT
asset,
year,
rep_period,
time_block_start,
array_agg(variable_index) AS indices,
array_agg(duration) AS durations
FROM(
SELECT
t_cons.asset,
t_flows.from_asset,
t_flows.to_asset,
t_flows.year,
t_flows.rep_period,
t_flows.variable_index,
t_cons.time_block_start,
greatest(
0,
1 + least(t_flows.time_block_end, t_cons.time_block_end) - greatest(t_flows.time_block_start, t_cons.time_block_start)
) AS duration
FROM t_flows
INNER JOIN t_cons ON t_flows.to_asset = t_cons.asset
AND t_flows.year = t_cons.year
AND t_flows.rep_period = t_cons.rep_period
)t
WHERE duration > 0
GROUP BY asset, YEAR, rep_period, time_block_start
Update1.
try use condition
least(t_flows.time_block_end, t_cons.time_block_end) - greatest(t_flows.time_block_start, t_cons.time_block_start)
in on clause.
This should significantly reduce the JOIN result.
SELECT
asset,
year,
rep_period,
time_block_start,
array_agg(variable_index) AS indices,
array_agg(duration) AS durations
FROM(
SELECT
t_cons.asset,
t_flows.from_asset,
t_flows.to_asset,
t_flows.year,
t_flows.rep_period,
t_flows.variable_index,
t_cons.time_block_start,
greatest(
0,
1 + least(t_flows.time_block_end, t_cons.time_block_end) - greatest(t_flows.time_block_start, t_cons.time_block_start)
) AS duration
FROM t_flows
INNER JOIN t_cons ON t_flows.to_asset = t_cons.asset
AND t_flows.year = t_cons.year
AND t_flows.rep_period = t_cons.rep_period
AND least(t_flows.time_block_end, t_cons.time_block_end) >= greatest(t_flows.time_block_start, t_cons.time_block_start)
)t
WHERE duration > 0
GROUP BY asset, YEAR, rep_period, time_block_start
;
Perhaps, clause WHERE duration > 0
in this case may be removed.
Fiddle
Upvotes: 0