Saurabh Ghadge
Saurabh Ghadge

Reputation: 1

How can I perform a SUM window function with a time range but handle duplicate timestamps row-wise in SQL

I have a scenario where I need to calculate a running total using the SUM window function in SQL. The issue arises because some rows have duplicate timestamps, and the RANGE clause in the window function groups all rows with the same timestamp together, causing incorrect calculations.

Here’s an example of the SQL I’m trying to use:

SUM(volume) OVER (
    PARTITION BY ID
    ORDER BY td.timestamp
    RANGE BETWEEN INTERVAL '60' SECOND PRECEDING AND CURRENT ROW
) AS total_volume

Problem:

Constraints:

Is there a way to adjust the SQL to process rows correctly within the same timestamp range while adhering to the time window logic?

Input

timestamp Volume
2024-11-16 08:00:00 10
2024-11-16 08:00:00 20
2024-11-16 08:01:00 30
2024-11-16 08:02:00 40
2024-11-16 08:02:00 50

Current Result (Using RANGE and Grouping by Timestamp)

timestamp RollVolume
2024-11-16 08:00:00 30
2024-11-16 08:00:00 30
2024-11-16 08:01:00 30
2024-11-16 08:02:00 90
2024-11-16 08:02:00 90

Expected Output

timestamp RollVolume
2024-11-16 08:00:00 10
2024-11-16 08:00:00 30
2024-11-16 08:01:00 30
2024-11-16 08:02:00 40
2024-11-16 08:02:00 90

Here, The RollVolume is calculated row by row within each timestamp, instead of grouping rows with identical timestamps.

Upvotes: 0

Views: 101

Answers (2)

keithwalsh
keithwalsh

Reputation: 881

As mentioned in the comments, deterministic row ordering is required for accurate results. The below uses PostgreSQL's ctid which represents the physical location of each row in a table but can change with table updates.

https://dbfiddle.uk/xxm_Ujpm

WITH ordered AS (
    SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp, Volume, ctid) AS rn
    FROM input
)
WITH ordered AS (
    SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp, Volume, ctid) AS rn
    FROM input
)
WITH ordered AS (
    SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp, Volume, ctid) AS rn
    FROM input
)
SELECT
    o1.timestamp,
    (
        SELECT SUM(o2.Volume)
        FROM ordered o2
        WHERE o2.timestamp > o1.timestamp - INTERVAL '60 seconds'
          AND o2.timestamp <= o1.timestamp
          AND ( 
              o2.timestamp < o1.timestamp 
              OR (o2.timestamp = o1.timestamp AND o2.rn <= o1.rn) 
          )
    ) AS RollVolume
FROM ordered o1
ORDER BY o1.timestamp, o1.rn;

Upvotes: 0

MatBailie
MatBailie

Reputation: 86775

The simplest option is to run a cumulative sum from start to finish, without the time RANGE, then use a second cumulative sum to deduct the unwanted rows, with the time RANGE...

This ensures you can use an id column to enforce an ordering without running into errors when trying to use RANGE BETWEEN.

It also ensures the you only include rows "> 60s ago" rather than rows ">= 60s ago".

Performance wise, it only scans the data once, avoiding the cost of correlated sub-queries.

CREATE TABLE example (
  id    BIGINT GENERATED ALWAYS AS IDENTITY,
  x     INT, 
  ts    TIMESTAMP,
  val   INT
) 
CREATE TABLE
INSERT INTO
  example (x, ts, val)
VALUES
  (1, '2024-11-16 08:00:00',    10),
  (1, '2024-11-16 08:00:00',    20), 
  (1, '2024-11-16 08:01:00',    30),
  (1, '2024-11-16 08:02:00',    40), 
  (1, '2024-11-16 08:02:00',    50) 
INSERT 0 5
SELECT
  *, 
  SUM(val)
    OVER (
      PARTITION BY x
          ORDER BY ts, id
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  )
  -
  COALESCE(
    SUM(val)
      OVER (
         PARTITION BY x
             ORDER BY ts
        RANGE BETWEEN UNBOUNDED PRECEDING AND INTERVAL '60' SECOND PRECEDING
    )
    ,
    0
  )
    AS rolling_total 
FROM
  example

id x ts val rolling_total
1 1 2024-11-16 08:00:00 10 10
2 1 2024-11-16 08:00:00 20 30
3 1 2024-11-16 08:01:00 30 30
4 1 2024-11-16 08:02:00 40 40
5 1 2024-11-16 08:02:00 50 90
SELECT 5

fiddle

Upvotes: 1

Related Questions