Cord Kaldemeyer
Cord Kaldemeyer

Reputation: 6897

Efficient way to get latest/newest version of versioned time series per group/aggregate function using TimescaleDB

We are moving to TimescaleDB and have migrated some large tables with more than 400 million rows holding versioned time series forecasts.

The structure of the table is the following whereas dt_start_utc holds the actual date of the forecast and version_utc the issue date of the forecast (the newer, the closer to the actual forecast date):

sandbox_cord=# \d+ fc_power_raw_import_normalized
                                    Table "public.fc_power_raw_import_normalized"
        Column        |            Type             | Collation | Nullable | Default | Storage | Stats target | Description
----------------------+-----------------------------+-----------+----------+---------+---------+--------------+-------------
dt_start_utc         | timestamp without time zone |           | not null |         | plain   |              |
fc_id                | integer                     |           | not null |         | plain   |              |
fc_kwh               | integer                     |           |          |         | plain   |              |
fc_power_supplier_id | integer                     |           | not null |         | plain   |              |
fc_power_type_id     | integer                     |           | not null |         | plain   |              |
version_utc          | timestamp without time zone |           | not null |         | plain   |              |
Indexes:
    "fc_power_raw_import_normalized2_pkey" PRIMARY KEY, btree (dt_start_utc, fc_id, fc_power_supplier_id, fc_power_type_id, version_utc)
    "fc_power_raw_import_normalize2_dt_start_utc_fc_id_fc_power_s_id" btree (dt_start_utc DESC, fc_id, fc_power_supplier_id, fc_power_type_id, version_utc DESC)
    "fc_power_raw_import_normalize2_dt_start_utc_fc_power_supplie_id" btree (dt_start_utc DESC, fc_power_supplier_id, fc_power_type_id)
    "fc_power_raw_import_normalized2_dt_start_utc_idx" btree (dt_start_utc DESC)
    "fc_power_raw_import_normalized2_version_utc_idx" btree (version_utc DESC)
Triggers:
    ts_insert_blocker BEFORE INSERT ON fc_power_raw_import_normalized2 FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Child tables: _timescaledb_internal._hyper_3_2334_chunk,
            _timescaledb_internal._hyper_3_2335_chunk,
            _timescaledb_internal._hyper_3_2336_chunk,
            _timescaledb_internal._hyper_3_2337_chunk,
            _timescaledb_internal._hyper_3_2338_chunk,
            _timescaledb_internal._hyper_3_2339_chunk
Access method: heap
...

And here are some values:

sandbox_cord=# SELECT * FROM fc_power_raw_import_normalized ORDER BY fc_id ASC LIMIT 25;
    dt_start_utc     | fc_id | fc_kwh | fc_power_supplier_id | fc_power_type_id |     version_utc
---------------------+-------+--------+----------------------+------------------+---------------------
2020-08-27 00:00:00 |     9 |    167 |                    5 |                1 | 2020-08-23 00:27:03
2020-08-27 00:00:00 |     9 |    150 |                    5 |                1 | 2020-08-23 01:12:37
2020-08-27 00:00:00 |     9 |    132 |                    5 |                1 | 2020-08-23 07:11:42
2020-08-27 00:00:00 |     9 |    144 |                    5 |                1 | 2020-08-23 13:12:11
2020-08-27 00:00:00 |     9 |    161 |                    5 |                1 | 2020-08-23 19:13:05
2020-08-27 00:00:00 |     9 |    166 |                    5 |                1 | 2020-08-24 01:11:53
...

Now I want to get the newest version of the timeseries per group (fc_id) which I get with the following query:

SELECT *
FROM fc_power_raw_import_normalized
WHERE  (dt_start_utc, fc_id, fc_power_supplier_id, fc_power_type_id, version_utc) IN (
    SELECT
    dt_start_utc, fc_id, fc_power_supplier_id, fc_power_type_id, MAX(version_utc) version_utc
    FROM
    fc_power_raw_import_normalized
    WHERE
    dt_start_utc > now() - INTERVAL '2 weeks'
    GROUP BY
    dt_start_utc, fc_id, fc_power_supplier_id, fc_power_type_id
)
AND dt_start_utc > now() - INTERVAL '2 weeks'
ORDER by fc_id, dt_start_utc, version_utc;

UPDATE or with the following query using TimescaleDBs last() function:

SELECT
dt_start_utc,
    fc_id,
    fc_power_supplier_id,
    fc_power_type_id,
    last(fc_kwh, version_utc) AS fc_kwh_last
FROM fc_power_raw_import_normalized
WHERE dt_start_utc > now () - INTERVAL '2 weeks'
GROUP BY dt_start_utc, fc_id, fc_power_supplier_id, fc_power_type_id
ORDER BY dt_start_utc ASC, fc_id ASC;

This yields:

   dt_start_utc     | fc_id | fc_kwh | fc_power_supplier_id | fc_power_type_id |     version_utc
---------------------+-------+--------+----------------------+------------------+---------------------
2021-10-12 16:45:00 |    19 |     99 |                    4 |                1 | 2021-10-12 13:13:50
2021-10-12 16:45:00 |    19 |     99 |                    4 |                2 | 2021-10-12 13:14:47
2021-10-12 17:00:00 |    19 |    100 |                    4 |                1 | 2021-10-12 13:13:50
2021-10-12 17:00:00 |    19 |    100 |                    4 |                2 | 2021-10-12 13:14:47
2021-10-12 17:15:00 |    19 |    103 |                    4 |                1 | 2021-10-12 13:13:50
2021-10-12 17:15:00 |    19 |    103 |                    4 |                2 | 2021-10-12 13:14:47
2021-10-12 17:30:00 |    19 |    105 |                    4 |                1 | 2021-10-12 13:13:50
2021-10-12 17:30:00 |    19 |    105 |                    4 |                2 | 2021-10-12 13:14:47
2021-10-12 17:45:00 |    19 |    108 |                    4 |                1 | 2021-10-12 13:13:50
2021-10-12 17:45:00 |    19 |    108 |                    4 |                2 | 2021-10-12 13:14:47
2021-10-12 18:00:00 |    19 |    108 |                    4 |                1 | 2021-10-12 13:13:50
2021-10-12 18:00:00 |    19 |    108 |                    4 |                2 | 2021-10-12 13:14:47
2021-10-12 18:15:00 |    19 |    105 |                    4 |                1 | 2021-10-12 13:13:50
2021-10-12 18:15:00 |    19 |    105 |                    4 |                2 | 2021-10-12 13:14:47
2021-10-12 18:30:00 |    19 |     82 |                    4 |                1 | 2021-10-12 18:28:47
2021-10-12 18:30:00 |    19 |     82 |                    4 |                2 | 2021-10-12 18:29:59
2021-10-12 18:45:00 |    19 |     82 |                    4 |                1 | 2021-10-12 18:28:47
2021-10-12 18:45:00 |    19 |     82 |                    4 |                2 | 2021-10-12 18:29:59
2021-10-12 19:00:00 |    19 |     81 |                    4 |                1 | 2021-10-12 18:28:47
...

But this is actually pretty slow e. g. it takes ~77 seconds to calculate this for 3 months and way more for longer horizons.

I have tried different variants of this query now using INNER JOIN or window functions and also added different indices according to this article. But none of these approaches has brought up a significant increase in performance.

Another approach was to work continuous aggregates which require a time bucket to be defined (see this question: Continuous aggregates in postgres/timescaledb requires time_bucket-function?) and does not seem to be suitable here. The underlying chunk size was another thing I have changed but also without significant improvements.

So far, when related to this table structure and query, the performance of TimescaleDB is not much better than our optimized "conventional" relational database which is MariaDB and I expected a dedicated time series database to come of with a better default performance.

Now my question is simple: How can I efficiently get the latest/newest version of versioned time series per group/aggregate function using TimescaleDB and this table structure?

Any hints are welcome!

Upvotes: 2

Views: 611

Answers (1)

Cord Kaldemeyer
Cord Kaldemeyer

Reputation: 6897

I now found a solution using Continuous Aggregates (with Real Time Aggregation as default behaviour). It might have been obvious but nonetheless took me a while to discover. So I will share it here in order to help someone else with the same problem.

The query is slightly adapted but the example shows the general approach.

Example:

  1. Create a continuous aggregate:

    CREATE MATERIALIZED VIEW latest_forecast_15_min
    WITH (timescaledb.continuous) AS
    SELECT
    time_bucket('15 minutes', dt_start_utc) AS dt_start,
    fc_id,
    fc_power_supplier_id,
    fc_power_type_id,
    last(fc_kwh, version_utc) AS fc_kwh_last
    FROM fc_power_raw_import_normalized
    WHERE dt_start_utc > now () - INTERVAL '2 weeks'
    GROUP BY dt_start, fc_id, fc_power_supplier_id, fc_power_type_id;
    
  2. Add a refresh policy for the last 2 weeks that updates hourly:

    SELECT add_continuous_aggregate_policy('latest_forecast_15_min',
        start_offset => INTERVAL '2 WEEKS',
        end_offset => INTERVAL '1 HOUR',
        schedule_interval => INTERVAL '1 h');
    
  3. Select data from the Continuous Aggregate:

    SELECT
    dt_start,
    fc_id,
    fc_power_supplier_id,
    fc_power_type_id,
    fc_kwh_last
    FROM latest_forecast_15_min;
    

For my example, this reduced the query time from ~13 seconds to ~2 seconds and provides the flexibility of even adding more aggregates for different query types such as groupings or time intervals.

Upvotes: 1

Related Questions