Reputation: 1
I have a scenario where I need to calculate a running total using the SUM window function in SQL. The issue arises because some rows have duplicate timestamps, and the RANGE clause in the window function groups all rows with the same timestamp together, causing incorrect calculations.
Here’s an example of the SQL I’m trying to use:
SUM(volume) OVER (
PARTITION BY ID
ORDER BY td.timestamp
RANGE BETWEEN INTERVAL '60' SECOND PRECEDING AND CURRENT ROW
) AS total_volume
Problem:
Constraints:
Is there a way to adjust the SQL to process rows correctly within the same timestamp range while adhering to the time window logic?
Input
timestamp | Volume |
---|---|
2024-11-16 08:00:00 | 10 |
2024-11-16 08:00:00 | 20 |
2024-11-16 08:01:00 | 30 |
2024-11-16 08:02:00 | 40 |
2024-11-16 08:02:00 | 50 |
Current Result (Using RANGE and Grouping by Timestamp)
timestamp | RollVolume |
---|---|
2024-11-16 08:00:00 | 30 |
2024-11-16 08:00:00 | 30 |
2024-11-16 08:01:00 | 30 |
2024-11-16 08:02:00 | 90 |
2024-11-16 08:02:00 | 90 |
Expected Output
timestamp | RollVolume |
---|---|
2024-11-16 08:00:00 | 10 |
2024-11-16 08:00:00 | 30 |
2024-11-16 08:01:00 | 30 |
2024-11-16 08:02:00 | 40 |
2024-11-16 08:02:00 | 90 |
Here, The RollVolume is calculated row by row within each timestamp, instead of grouping rows with identical timestamps.
Upvotes: 0
Views: 101
Reputation: 881
As mentioned in the comments, deterministic row ordering is required for accurate results. The below uses PostgreSQL's ctid
which represents the physical location of each row in a table but can change with table updates.
WITH ordered AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp, Volume, ctid) AS rn
FROM input
)
WITH ordered AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp, Volume, ctid) AS rn
FROM input
)
WITH ordered AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp, Volume, ctid) AS rn
FROM input
)
SELECT
o1.timestamp,
(
SELECT SUM(o2.Volume)
FROM ordered o2
WHERE o2.timestamp > o1.timestamp - INTERVAL '60 seconds'
AND o2.timestamp <= o1.timestamp
AND (
o2.timestamp < o1.timestamp
OR (o2.timestamp = o1.timestamp AND o2.rn <= o1.rn)
)
) AS RollVolume
FROM ordered o1
ORDER BY o1.timestamp, o1.rn;
Upvotes: 0
Reputation: 86775
The simplest option is to run a cumulative sum from start to finish, without the time RANGE, then use a second cumulative sum to deduct the unwanted rows, with the time RANGE...
This ensures you can use an id column to enforce an ordering without running into errors when trying to use RANGE BETWEEN.
It also ensures the you only include rows "> 60s ago" rather than rows ">= 60s ago".
Performance wise, it only scans the data once, avoiding the cost of correlated sub-queries.
CREATE TABLE example (
id BIGINT GENERATED ALWAYS AS IDENTITY,
x INT,
ts TIMESTAMP,
val INT
)
CREATE TABLE
INSERT INTO
example (x, ts, val)
VALUES
(1, '2024-11-16 08:00:00', 10),
(1, '2024-11-16 08:00:00', 20),
(1, '2024-11-16 08:01:00', 30),
(1, '2024-11-16 08:02:00', 40),
(1, '2024-11-16 08:02:00', 50)
INSERT 0 5
SELECT
*,
SUM(val)
OVER (
PARTITION BY x
ORDER BY ts, id
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
-
COALESCE(
SUM(val)
OVER (
PARTITION BY x
ORDER BY ts
RANGE BETWEEN UNBOUNDED PRECEDING AND INTERVAL '60' SECOND PRECEDING
)
,
0
)
AS rolling_total
FROM
example
id | x | ts | val | rolling_total |
---|---|---|---|---|
1 | 1 | 2024-11-16 08:00:00 | 10 | 10 |
2 | 1 | 2024-11-16 08:00:00 | 20 | 30 |
3 | 1 | 2024-11-16 08:01:00 | 30 | 30 |
4 | 1 | 2024-11-16 08:02:00 | 40 | 40 |
5 | 1 | 2024-11-16 08:02:00 | 50 | 90 |
SELECT 5
Upvotes: 1