HansHupe
HansHupe

Reputation: 496

Time series database with time-weighted-average aggregation function for irregular time series?

Our sensors produce values in irregular time intervals:

12:00 10 12:02 20 12:22 30 12:29 40

I am trying to find a time series database that can automatically calculate the average value for certain regular time intervals (e.g. 10min). Of course the longer a value was valid within the interval the more weight it has in the average (time weighted average). (e.g. 12:00-12:10: (10*2+20*8)/10=18) )

I am searching now on the internet for hours and found lots of time series databases that talk about irregular time series (e.g. InfluxDB, OpenTDSB, etc.) and most of them have some SQL-like query language with aggregation functions.

Unfortunately they don't say how exactly irregular time intervals are averaged. As I don't want to try all of them, can someone tell me which databases support the calculation of a time-weighted-average? Thanks!

Upvotes: 4

Views: 4142

Answers (4)

valyala
valyala

Reputation: 18084

The time-weighted average (TWA) can be calculated if TSDB supports the function for value integration over the given time range. Then the TWA can be calculated as integral over the given duration divided by the duration. For example, the following query calculates the time-weighted average for the metric power over the last hour in VictoriaMetrics:

integrate(power[1h])/1h

See more details about the integrate() function at MetricsQL docs.

Upvotes: 0

Sergei Rodionov
Sergei Rodionov

Reputation: 4559

Weighted time average aggregator (wtavg) is supported in Axibase Time Series Database: http://axibase.com/products/axibase-time-series-database/visualization/widgets/configuring-the-widgets/aggregators/

wtavg weighs older samples at a linearly decreasing rate compared to current time.

This aggregator is supported in REST API, SQL layer, and the rule engine.

EDIT 2016-06-15T12:52:00Z: Supported interpolation functions:

  1. LINEAR
  2. PREVIOUS
  3. NEXT
  4. VALUE(v)
  5. NONE

Disclosure: I work for Axibase.

Upvotes: 1

Sharon Rosner
Sharon Rosner

Reputation: 21

I recently had to provide a solution for weighted average over irregular samples for our own SCADA/IoT product, with data stored in PostgreSQL. If you'd like to roll your own, here's how you can do it.

Let's assume the following table:

create table samples (
  stamp  timestamptz,
  series integer,
  value  float
);

insert into samples values
  ('2018-04-30 23:00:00+02', 1, 12.3),
  ('2018-05-01 01:45:00+02', 1, 22.2),
  ('2018-05-01 02:13:00+02', 1, 21.6),
  ('2018-05-01 02:26:00+02', 1, 14.9),
  ('2018-05-01 03:02:00+02', 1, 16.9);

To calculate a regular, weighted average, we'll need to do the following:

  • "Partition" the irregular samples into regular periods
  • Determine how long each sample was held (its duration)
  • Calculate a weight for each sample (its duration divided by the period)
  • Sum up value times weight for each period

Before presenting the code, we'll make the following assumptions:

  • The weighted average is calculated for a given time range, with a given period.
  • We don't need to deal with null values, which would make the solution slightly more complicated (namely when it comes to calculating weights).
  • The code is written for PostgreSQL using two techniques: common table expressions and window functions. If you use another DB, you might need to write it differently.

1. Converting irregular samples into regular periods

Suppose we are interested in calculating hourly weighted average for the time period between 2018-05-01 00:00:00+02 and 2018-05-01 04:00:00+02 for series 1. We'll start by querying for the given time range, adding an aligned stamp:

select
  stamp,
  to_timestamp(extract (epoch from stamp)::integer / 3600 * 3600)
    as stamp_aligned,
  value
from samples
where
  series = 1 and
  stamp >= '2018-05-01 00:00:00+02' and
  stamp <= '2018-05-01 04:00:00+02';

Which gives us:

         stamp          |     stamp_aligned      | value 
------------------------+------------------------+-------
 2018-05-01 01:45:00+02 | 2018-05-01 01:00:00+02 |  22.2
 2018-05-01 02:13:00+02 | 2018-05-01 02:00:00+02 |  21.6
 2018-05-01 02:26:00+02 | 2018-05-01 02:00:00+02 |  14.9
 2018-05-01 03:02:00+02 | 2018-05-01 03:00:00+02 |  16.9
(4 rows)

We will have noticed that:

  • From the result we can't tell the value for 00:00:00, nor the value for 01:00:00.
  • The stamp_aligned column tell us to which time period the record belongs, but in fact the table is missing the value at the beginning of each period.

To solve these problems, we'll query for the last known value before the given time range, and add records for round hours, which we'll later fill with correct values:

with
t_values as (
  select * from (
    -- select last value prior to time range
    (select
      stamp,
      to_timestamp(extract(epoch from stamp)::integer / 3600 * 3600)
        as stamp_aligned,
      value,
      false as filled_in
    from samples
    where
      series = 1 and
      stamp <  '2018-05-01 00:00:00+02'
    order by
      stamp desc
    limit 1) union

    -- select records from given time range
    (select 
      stamp,
      to_timestamp(extract(epoch from stamp)::integer / 3600 * 3600)
        as stamp_aligned,
      value,
      false as filled_in
    from samples
    where
      series = 1 and
      stamp >= '2018-05-01 00:00:00+02' and
      stamp <= '2018-05-01 04:00:00+02'
    order by
      stamp) union

    -- select all regular periods for given time range
    (select
      stamp,
      stamp as stamp_aligned,
      null as value,
      true as filled_in
    from generate_series(
      '2018-05-01 00:00:00+02',
      '2018-05-01 04:00:00+02',
      interval '3600 seconds'
    ) stamp)
  ) states
  order by stamp
)
select * from t_values;

Which gives us

         stamp          |     stamp_aligned      | value | filled_in 
------------------------+------------------------+-------+-----------
 2018-04-30 23:00:00+02 | 2018-04-30 23:00:00+02 |  12.3 | f
 2018-05-01 00:00:00+02 | 2018-05-01 00:00:00+02 |     ¤ | t
 2018-05-01 01:00:00+02 | 2018-05-01 01:00:00+02 |     ¤ | t
 2018-05-01 01:45:00+02 | 2018-05-01 01:00:00+02 |  22.2 | f
 2018-05-01 02:00:00+02 | 2018-05-01 02:00:00+02 |     ¤ | t
 2018-05-01 02:13:00+02 | 2018-05-01 02:00:00+02 |  21.6 | f
 2018-05-01 02:26:00+02 | 2018-05-01 02:00:00+02 |  14.9 | f
 2018-05-01 03:00:00+02 | 2018-05-01 03:00:00+02 |     ¤ | t
 2018-05-01 03:02:00+02 | 2018-05-01 03:00:00+02 |  16.9 | f
 2018-05-01 04:00:00+02 | 2018-05-01 04:00:00+02 |     ¤ | t
(10 rows)

So we have at least one record for each time period, but we still need to fill in values for the filled-in records:

with
t_values as (
  ...
),
-- since records generated using generate_series do not contain values,
-- we need to copy the value from the last non-generated record.
t_with_filled_in_values as (
  -- the outer query serves to remove any record prior to the given 
  -- time range
  select *
  from (
    select 
      stamp,
      stamp_aligned,
      -- fill in value from last non-filled record (the first record 
      -- having the same filled_in_partition value)
      (case when filled_in then
        first_value(value) over (partition by filled_in_partition
        order by stamp) else value end) as value
    from (
      select
        stamp, 
        stamp_aligned, 
        value,
        filled_in,
        -- this field is incremented on every non-filled record
        sum(case when filled_in then 0 else 1 end) 
          over (order by stamp) as filled_in_partition
      from 
        t_values
    ) t_filled_in_partition
  ) t_filled_in_values
  -- we wrap the filling-in query in order to remove any record before the
  -- beginning of the given time range
  where stamp >= '2018-05-01 00:00:00+02'
  order by stamp
)
select * from t_with_filled_in_values;

Which gives us the following:

         stamp          |     stamp_aligned      | value 
------------------------+------------------------+-------
 2018-05-01 00:00:00+02 | 2018-05-01 00:00:00+02 |  12.3
 2018-05-01 01:00:00+02 | 2018-05-01 01:00:00+02 |  12.3
 2018-05-01 01:45:00+02 | 2018-05-01 01:00:00+02 |  22.2
 2018-05-01 02:00:00+02 | 2018-05-01 02:00:00+02 |  22.2
 2018-05-01 02:13:00+02 | 2018-05-01 02:00:00+02 |  21.6
 2018-05-01 02:26:00+02 | 2018-05-01 02:00:00+02 |  14.9
 2018-05-01 03:00:00+02 | 2018-05-01 03:00:00+02 |  14.9
 2018-05-01 03:02:00+02 | 2018-05-01 03:00:00+02 |  16.9
 2018-05-01 04:00:00+02 | 2018-05-01 04:00:00+02 |  16.9
(9 rows)

So we're all good - we've added records with correct values for all round hours and we also removed the first record which gave us the value for the beginning of the time range, but was lying outside of it. No we are ready for the next step.

2. Calculating weighted average

We'll continue by calculating the duration for each record:

with
t_values as (
  ...
),
t_with_filled_in_values (
  ...
),
t_with_weight as (
  select
    stamp,
    stamp_aligned,
    value,
    -- use window to get stamp from next record in order to calculate 
    -- the duration of the record which, divided by the period, gives 
    -- us the weight.
    coalesce(extract(epoch from (lead(stamp)
      over (order by stamp) - stamp)), 3600)::float / 3600 as weight
  from t_with_filled_in_values
  order by stamp
)
select * from t_with_weight;

Which gives us:

         stamp          |     stamp_aligned      | value |       weight       
------------------------+------------------------+-------+--------------------
 2018-05-01 00:00:00+02 | 2018-05-01 00:00:00+02 |  12.3 |                  1
 2018-05-01 01:00:00+02 | 2018-05-01 01:00:00+02 |  12.3 |               0.75
 2018-05-01 01:45:00+02 | 2018-05-01 01:00:00+02 |  22.2 |               0.25
 2018-05-01 02:00:00+02 | 2018-05-01 02:00:00+02 |  22.2 |  0.216666666666667
 2018-05-01 02:13:00+02 | 2018-05-01 02:00:00+02 |  21.6 |  0.216666666666667
 2018-05-01 02:26:00+02 | 2018-05-01 02:00:00+02 |  14.9 |  0.566666666666667
 2018-05-01 03:00:00+02 | 2018-05-01 03:00:00+02 |  14.9 | 0.0333333333333333
 2018-05-01 03:02:00+02 | 2018-05-01 03:00:00+02 |  16.9 |  0.966666666666667
 2018-05-01 04:00:00+02 | 2018-05-01 04:00:00+02 |  16.9 |                  1
(9 rows)

All that's left is to sum it up:

with
t_values as (
  ...
),
t_with_filled_in_values (
  ...
),
t_with_weight as (
  ...
)
select
  stamp_aligned as stamp,
  sum(value * weight) as avg
from t_with_weight
group by stamp_aligned
order by stamp_aligned;

The result:

         stamp          |       avg        
------------------------+------------------
 2018-05-01 00:00:00+02 |             12.3
 2018-05-01 01:00:00+02 |           14.775
 2018-05-01 02:00:00+02 | 17.9333333333333
 2018-05-01 03:00:00+02 | 16.8333333333333
 2018-05-01 04:00:00+02 |             16.9
(5 rows)

You can find the complete code in this gist.

Upvotes: 2

welch
welch

Reputation: 964

OpenTSDB performs aggregation across all series in a query at the time(s) implied by the query. For any series not having a data value at a timestamp, it linearly interpolates a value from the values before and after. It does this "upsampling" at query time -- the original data is always stored as it was when it arrived. You can perform a trailing windowed time average, but not an exponentially weighted moving average (I believe that is what you meant by time-weighted?)

http://opentsdb.net/docs/build/html/user_guide/query/aggregators.html

(I should add, that's not a blanket recommendation for OpenTSDB as the db you should use, I'm just responding to your question)

Upvotes: 2

Related Questions