Reputation: 6897
We are moving to TimescaleDB and have migrated some large tables with more than 400 million rows holding versioned time series forecasts.
The structure of the table is the following whereas dt_start_utc
holds the actual date of the forecast and version_utc
the issue date of the forecast (the newer, the closer to the actual forecast date):
sandbox_cord=# \d+ fc_power_raw_import_normalized
Table "public.fc_power_raw_import_normalized"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
----------------------+-----------------------------+-----------+----------+---------+---------+--------------+-------------
dt_start_utc | timestamp without time zone | | not null | | plain | |
fc_id | integer | | not null | | plain | |
fc_kwh | integer | | | | plain | |
fc_power_supplier_id | integer | | not null | | plain | |
fc_power_type_id | integer | | not null | | plain | |
version_utc | timestamp without time zone | | not null | | plain | |
Indexes:
"fc_power_raw_import_normalized2_pkey" PRIMARY KEY, btree (dt_start_utc, fc_id, fc_power_supplier_id, fc_power_type_id, version_utc)
"fc_power_raw_import_normalize2_dt_start_utc_fc_id_fc_power_s_id" btree (dt_start_utc DESC, fc_id, fc_power_supplier_id, fc_power_type_id, version_utc DESC)
"fc_power_raw_import_normalize2_dt_start_utc_fc_power_supplie_id" btree (dt_start_utc DESC, fc_power_supplier_id, fc_power_type_id)
"fc_power_raw_import_normalized2_dt_start_utc_idx" btree (dt_start_utc DESC)
"fc_power_raw_import_normalized2_version_utc_idx" btree (version_utc DESC)
Triggers:
ts_insert_blocker BEFORE INSERT ON fc_power_raw_import_normalized2 FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Child tables: _timescaledb_internal._hyper_3_2334_chunk,
_timescaledb_internal._hyper_3_2335_chunk,
_timescaledb_internal._hyper_3_2336_chunk,
_timescaledb_internal._hyper_3_2337_chunk,
_timescaledb_internal._hyper_3_2338_chunk,
_timescaledb_internal._hyper_3_2339_chunk
Access method: heap
...
And here are some values:
sandbox_cord=# SELECT * FROM fc_power_raw_import_normalized ORDER BY fc_id ASC LIMIT 25;
dt_start_utc | fc_id | fc_kwh | fc_power_supplier_id | fc_power_type_id | version_utc
---------------------+-------+--------+----------------------+------------------+---------------------
2020-08-27 00:00:00 | 9 | 167 | 5 | 1 | 2020-08-23 00:27:03
2020-08-27 00:00:00 | 9 | 150 | 5 | 1 | 2020-08-23 01:12:37
2020-08-27 00:00:00 | 9 | 132 | 5 | 1 | 2020-08-23 07:11:42
2020-08-27 00:00:00 | 9 | 144 | 5 | 1 | 2020-08-23 13:12:11
2020-08-27 00:00:00 | 9 | 161 | 5 | 1 | 2020-08-23 19:13:05
2020-08-27 00:00:00 | 9 | 166 | 5 | 1 | 2020-08-24 01:11:53
...
Now I want to get the newest version of the timeseries per group (fc_id
) which I get with the following query:
SELECT *
FROM fc_power_raw_import_normalized
WHERE (dt_start_utc, fc_id, fc_power_supplier_id, fc_power_type_id, version_utc) IN (
SELECT
dt_start_utc, fc_id, fc_power_supplier_id, fc_power_type_id, MAX(version_utc) version_utc
FROM
fc_power_raw_import_normalized
WHERE
dt_start_utc > now() - INTERVAL '2 weeks'
GROUP BY
dt_start_utc, fc_id, fc_power_supplier_id, fc_power_type_id
)
AND dt_start_utc > now() - INTERVAL '2 weeks'
ORDER by fc_id, dt_start_utc, version_utc;
UPDATE or with the following query using TimescaleDBs last() function:
SELECT
dt_start_utc,
fc_id,
fc_power_supplier_id,
fc_power_type_id,
last(fc_kwh, version_utc) AS fc_kwh_last
FROM fc_power_raw_import_normalized
WHERE dt_start_utc > now () - INTERVAL '2 weeks'
GROUP BY dt_start_utc, fc_id, fc_power_supplier_id, fc_power_type_id
ORDER BY dt_start_utc ASC, fc_id ASC;
This yields:
dt_start_utc | fc_id | fc_kwh | fc_power_supplier_id | fc_power_type_id | version_utc
---------------------+-------+--------+----------------------+------------------+---------------------
2021-10-12 16:45:00 | 19 | 99 | 4 | 1 | 2021-10-12 13:13:50
2021-10-12 16:45:00 | 19 | 99 | 4 | 2 | 2021-10-12 13:14:47
2021-10-12 17:00:00 | 19 | 100 | 4 | 1 | 2021-10-12 13:13:50
2021-10-12 17:00:00 | 19 | 100 | 4 | 2 | 2021-10-12 13:14:47
2021-10-12 17:15:00 | 19 | 103 | 4 | 1 | 2021-10-12 13:13:50
2021-10-12 17:15:00 | 19 | 103 | 4 | 2 | 2021-10-12 13:14:47
2021-10-12 17:30:00 | 19 | 105 | 4 | 1 | 2021-10-12 13:13:50
2021-10-12 17:30:00 | 19 | 105 | 4 | 2 | 2021-10-12 13:14:47
2021-10-12 17:45:00 | 19 | 108 | 4 | 1 | 2021-10-12 13:13:50
2021-10-12 17:45:00 | 19 | 108 | 4 | 2 | 2021-10-12 13:14:47
2021-10-12 18:00:00 | 19 | 108 | 4 | 1 | 2021-10-12 13:13:50
2021-10-12 18:00:00 | 19 | 108 | 4 | 2 | 2021-10-12 13:14:47
2021-10-12 18:15:00 | 19 | 105 | 4 | 1 | 2021-10-12 13:13:50
2021-10-12 18:15:00 | 19 | 105 | 4 | 2 | 2021-10-12 13:14:47
2021-10-12 18:30:00 | 19 | 82 | 4 | 1 | 2021-10-12 18:28:47
2021-10-12 18:30:00 | 19 | 82 | 4 | 2 | 2021-10-12 18:29:59
2021-10-12 18:45:00 | 19 | 82 | 4 | 1 | 2021-10-12 18:28:47
2021-10-12 18:45:00 | 19 | 82 | 4 | 2 | 2021-10-12 18:29:59
2021-10-12 19:00:00 | 19 | 81 | 4 | 1 | 2021-10-12 18:28:47
...
But this is actually pretty slow e. g. it takes ~77 seconds to calculate this for 3 months and way more for longer horizons.
I have tried different variants of this query now using INNER JOIN
or window functions and also added different indices according to this article. But none of these approaches has brought up a significant increase in performance.
Another approach was to work continuous aggregates which require a time bucket to be defined (see this question: Continuous aggregates in postgres/timescaledb requires time_bucket-function?) and does not seem to be suitable here. The underlying chunk size was another thing I have changed but also without significant improvements.
So far, when related to this table structure and query, the performance of TimescaleDB is not much better than our optimized "conventional" relational database which is MariaDB and I expected a dedicated time series database to come of with a better default performance.
Now my question is simple: How can I efficiently get the latest/newest version of versioned time series per group/aggregate function using TimescaleDB and this table structure?
Any hints are welcome!
Upvotes: 2
Views: 611
Reputation: 6897
I now found a solution using Continuous Aggregates (with Real Time Aggregation as default behaviour). It might have been obvious but nonetheless took me a while to discover. So I will share it here in order to help someone else with the same problem.
The query is slightly adapted but the example shows the general approach.
Example:
Create a continuous aggregate:
CREATE MATERIALIZED VIEW latest_forecast_15_min
WITH (timescaledb.continuous) AS
SELECT
time_bucket('15 minutes', dt_start_utc) AS dt_start,
fc_id,
fc_power_supplier_id,
fc_power_type_id,
last(fc_kwh, version_utc) AS fc_kwh_last
FROM fc_power_raw_import_normalized
WHERE dt_start_utc > now () - INTERVAL '2 weeks'
GROUP BY dt_start, fc_id, fc_power_supplier_id, fc_power_type_id;
Add a refresh policy for the last 2 weeks that updates hourly:
SELECT add_continuous_aggregate_policy('latest_forecast_15_min',
start_offset => INTERVAL '2 WEEKS',
end_offset => INTERVAL '1 HOUR',
schedule_interval => INTERVAL '1 h');
Select data from the Continuous Aggregate:
SELECT
dt_start,
fc_id,
fc_power_supplier_id,
fc_power_type_id,
fc_kwh_last
FROM latest_forecast_15_min;
For my example, this reduced the query time from ~13 seconds
to ~2 seconds
and provides the flexibility of even adding more aggregates for different query types such as groupings or time intervals.
Upvotes: 1