how to setup continuous aggregation in timescaledb for historical data

Question

I'd like to have materialized data automatically calculated as new (however historical) data comes.

My data are historical time series manually added by user (application is used but data import is triggered manually). The data is stored in one table promoted to timescale. Every time user adds about 36000 rows with 100ms sampling rate (about one hour of data). The first sample is never tied to the beginning of an hour and, also, the following portion of data is not mandatory tied to the previous portion (there might be gaps).

What I want, is to have materialized data in buckets of 10 minutes, 1 hour, 1 day and the process of data materialization should run automatically.

My first confusion is should I create continuous_aggregate_policy after I created continuous aggregate. Currently I don't do that. When I inserted the first portion of data (with simple INSERT), I could see that the data materialized successfully (I checked this by setting timescaledb.materialized_only = true). Then, when I've added a lot more data, it wasn't materialized. I found the scheduler job (there is just one job) and tried to reschedule it like SELECT alter_job(1, next_start => now()); but nothing happened, data are still not materialized (I see last_run_status is Success and last_successful_finish changed to the moment when I ran a query). However, hypertable_schema and hypertable_name fields are NULL, so I wonder if this background job is going to do anything.

Adding continuous_aggregate_policy looks not logical to me if I add historical data (like 1-hour-long data but 1 year old). As I undertand, continuous_aggregate_policy updates only recent data.

Should my option be to materialize the data manually right after I add it by calling CALL refresh_continuous_aggregate('conditions', '2020-01-01', '2020-02-01'); ? Is data not going to materialize if I don't do this manually upon inserting the data? In the case of manual refreshing I afraid of possible data inconsistency if my application will fail for some reason between commiting the new data and calling continuous aggregate refresh query. In this case, can I just do both operations in one query? What happen if this query fails then? How do I know if the data was committed? Many uprising questions...

Next step in my project will be to automate data ingest, so the interaction with the user and supervision of that process are excluded.

So, what is the right strategy to automate creation of materialized data for manually added historical data?

P.S. re-creating materialized viewes after inserting of new data is not an option because it'll take days to complete.

k_rus · Accepted Answer

I assume that TimescaleDB version is 2.x. Before 2.0 creating a continuous aggregate view was always creating the corresponding policy. This change and motivation is described in Changes in TimescaleDB 2.0.

It is necessary to create continuous aggregate policies for each continuous aggregate otherwise no data will be materialized into continuous aggregates automatically. I am not sure how you were able to get data into the continuous aggregate initially.

What is the job 1 about? You can check existing jobs in timescaledb_information.jobs. TimescaleDB usually creates a job for telemetry, so it might be it.

When you create a policy to refresh continuous aggregate, make sure to choose a refresh window, which will cover the newly inserted data at the moment the policy will run. See start_offset and end_offset parameters. Note that only fully covered buckets will be refreshed. Covering old data should not be an issue, since TimescaleDB keeps track of invalidated buckets due to newly inserted or updated data, and doesn't refresh buckets where no changes happen into.

how to setup continuous aggregation in timescaledb for historical data

Answers (1)

Related Questions