TheStranger
TheStranger

Reputation: 1587

How do I use AVG() with GROUP BY in time_bucket_gapfill() in TimeScaleDB, PostgreSQL?

I'm using TimescaleDB in my PostgreSQL and I have the following two Tables:

windows_log

| windows_log_id |      timestamp      | computer_id | log_count |
------------------------------------------------------------------
|        1       | 2021-01-01 00:01:02 |     382     |     30    |
|        2       | 2021-01-02 14:59:55 |     382     |     20    |
|        3       | 2021-01-02 19:08:24 |     382     |     20    |
|        4       | 2021-01-03 13:05:36 |     382     |     10    |
|        5       | 2021-01-03 22:21:14 |     382     |     40    |

windows_reliability_score

| computer_id (FK) |      timestamp      | reliability_score |
--------------------------------------------------------------
|        382       | 2021-01-01 22:21:14 |          6        |
|        382       | 2021-01-01 22:21:14 |          6        |
|        382       | 2021-01-01 22:21:14 |          6        |
|        382       | 2021-01-02 22:21:14 |          1        |
|        382       | 2021-01-02 22:21:14 |          3        |
|        382       | 2021-01-03 22:21:14 |          7        |
|        382       | 2021-01-03 22:21:14 |          8        |
|        382       | 2021-01-03 22:21:14 |          9        |

Note: In both tables is indexed on the timestamp column (hypertable)

So I'm trying to get the average reliability_score for each time bucket, but it just gives me the average for everything, instead of the average per specific bucket...

This is my query:

SELECT time_bucket_gapfill(CAST(1 * INTERVAL '1 day' AS INTERVAL), wl.timestamp) AS timestamp, 
COALESCE(SUM(log_count), 0) AS log_count,
AVG(reliability_score) AS reliability_score
FROM windows_log wl
JOIN reliability_score USING (computer_id)
WHERE wl.time >= '2021-01-01 00:00:00.0' AND wl.time < '2021-01-04 00:00:00.0'
GROUP BY timestamp
ORDER BY timestamp asc

This is the result I'm looking for:

|      timestamp      | log_count | reliability_score |
-------------------------------------------------------
| 2021-01-01 00:00:00 |     30    |          6        |
| 2021-01-02 00:00:00 |     20    |          2        |
| 2021-01-03 00:00:00 |     20    |          8        |

But this is what I get:

|      timestamp      | log_count | reliability_score |
-------------------------------------------------------
| 2021-01-01 00:00:00 |     30    |        5.75       |
| 2021-01-02 00:00:00 |     20    |        5.75       |
| 2021-01-03 00:00:00 |     20    |        5.75       |

Upvotes: 0

Views: 1048

Answers (2)

Ryan
Ryan

Reputation: 156

Given what we can glean from your example, there's no simple way to do a join between these two tables, with the given functions, and achieve the results you want. The schema, as presented, just makes that difficult.

If this is really what your data/schema look like, then one solution is to use multiple CTE's to get the two values from each distinct table and then join based on bucket and computer.

WITH wrs AS (
    SELECT time_bucket_gapfill('1 day', timestamp) AS bucket, 
    computer_id,
    AVG(reliability_score) AS reliability_score  
    FROM windows_reliability_score
    WHERE timestamp >= '2021-01-01 00:00:00.0' AND timestamp < '2021-01-04 00:00:00.0'
    GROUP BY 1, 2
),
wl AS (
    SELECT time_bucket_gapfill('1 day', wl.timestamp) bucket, wl.computer_id,
    sum(log_count) total_logs
    FROM windows_log wl
    WHERE timestamp >= '2021-01-01 00:00:00.0' AND timestamp < '2021-01-04 00:00:00.0'
    GROUP BY 1, 2
)
SELECT wrs.bucket, wrs.computer_id, reliability_score, total_logs
FROM wrs LEFT JOIN wl ON wrs.bucket = wl.bucket AND wrs.computer_id = wl.computer_id;

The filtering would have to be applied internally to each query because pushdown on the outer query likely wouldn't happen and so then you would scan the entire hypertable before the date filter is applied (not what you want I assume).

I tried to quickly re-create your sample schema, so I apologize if I got names wrong somewhere.

Upvotes: 1

k_rus
k_rus

Reputation: 3219

The main issue is that the join codition is on column computer_id, where both tables have the same values 382. Thus each column from table windows_log will be joined with each column from table reliability_score (Cartesian product of all rows). Also the grouping is done on column timestamp, which is ambigous, and is likely to be resolved to timestamp from windows_log. This leads to the result that average will use all values of reliability_score for each value of the timestamp from windows_log and explains the undesired result.

The resolution of the gropuing ambiguity, which resolved in favor the inner column, i.e., the table column, is explained in GROUP BY description in SELECT documentation:

In case of ambiguity, a GROUP BY name will be interpreted as an input-column name rather than an output column name.

To avoid having groups, which includes all rows matching on computer id, windows_log_id can be used for grouping. This will allow to bring log_count to the query result. And if it is desire to keep the output name timestamp, GROUP BY should use the reference to the position. For example:

SELECT time_bucket_gapfill('1 day'::INTERVAL, rs.timestamp) AS timestamp, 
AVG(reliability_score) AS reliability_score,
log_count
FROM windows_log wl
JOIN reliability_score rs USING (computer_id)
WHERE rs.timestamp >= '2021-01-01 00:00:00.0' AND rs.timestamp < '2021-01-04 00:00:00.0'
GROUP BY 1, windows_log_id, log_count
ORDER BY timestamp asc

For ORDER BY it is not an issue, since the output name is used. From the same doc:

If an ORDER BY expression is a simple name that matches both an output column name and an input column name, ORDER BY will interpret it as the output column name.

Upvotes: 0

Related Questions