I have a scenario where I need to calculate a running total using the SUM window function in SQL. The issue arises because some rows have duplicate timestamps, and the RANGE clause in the window function groups all rows with the same timestamp together, causing incorrect calculations. Here’s an example of the SQL I’m trying to use: SUM(volume) OVER ( PARTITION BY ID ORDER BY td.timestamp RANGE BETWEEN INTERVAL '60' SECOND PRECEDING AND CURRENT ROW ) AS total_volume Problem: When there are duplicate timestamps, the RANGE function groups all entries with the same timestamp into the same window, leading to unexpected results. I need to process the rows individually (row-wise) within the same timestamp. Constraints: I can't add any slight noise in timestamp column, as it will change my time window. It is to be calculate in precision. Is there a way to adjust the SQL to process rows correctly within the same timestamp range while adhering to the time window logic? Input timestamp Volume 2024-11-16 08:00:00 10 2024-11-16 08:00:00 20 2024-11-16 08:01:00 30 2024-11-16 08:02:00 40 2024-11-16 08:02:00 50 Current Result (Using RANGE and Grouping by Timestamp) timestamp RollVolume 2024-11-16 08:00:00 30 2024-11-16 08:00:00 30 2024-11-16 08:01:00 30 2024-11-16 08:02:00 90 2024-11-16 08:02:00 90 Expected Output timestamp RollVolume 2024-11-16 08:00:00 10 2024-11-16 08:00:00 30 2024-11-16 08:01:00 30 2024-11-16 08:02:00 40 2024-11-16 08:02:00 90 Here, The RollVolume is calculated row by row within each timestamp, instead of grouping rows with identical timestamps.

sqlrangewindow-functionsrolling-computationpartition-by

Reputation: 1

How can I perform a SUM window function with a time range but handle duplicate timestamps row-wise in SQL

I have a scenario where I need to calculate a running total using the SUM window function in SQL. The issue arises because some rows have duplicate timestamps, and the RANGE clause in the window function groups all rows with the same timestamp together, causing incorrect calculations.

Here’s an example of the SQL I’m trying to use:

SUM(volume) OVER (
    PARTITION BY ID
    ORDER BY td.timestamp
    RANGE BETWEEN INTERVAL '60' SECOND PRECEDING AND CURRENT ROW
) AS total_volume

Problem:

When there are duplicate timestamps, the RANGE function groups all entries with the same timestamp into the same window, leading to unexpected results.
I need to process the rows individually (row-wise) within the same timestamp.

Constraints:

I can't add any slight noise in timestamp column, as it will change my time window. It is to be calculate in precision.

Is there a way to adjust the SQL to process rows correctly within the same timestamp range while adhering to the time window logic?

Input

timestamp	Volume
2024-11-16 08:00:00	10
2024-11-16 08:00:00	20
2024-11-16 08:01:00	30
2024-11-16 08:02:00	40
2024-11-16 08:02:00	50

Current Result (Using RANGE and Grouping by Timestamp)

timestamp	RollVolume
2024-11-16 08:00:00	30
2024-11-16 08:00:00	30
2024-11-16 08:01:00	30
2024-11-16 08:02:00	90
2024-11-16 08:02:00	90

Expected Output

timestamp	RollVolume
2024-11-16 08:00:00	10
2024-11-16 08:00:00	30
2024-11-16 08:01:00	30
2024-11-16 08:02:00	40
2024-11-16 08:02:00	90

Here, The RollVolume is calculated row by row within each timestamp, instead of grouping rows with identical timestamps.

Upvotes: 0

Answers (2)

keithwalsh

Reputation: 881

As mentioned in the comments, deterministic row ordering is required for accurate results. The below uses PostgreSQL's ctid which represents the physical location of each row in a table but can change with table updates.

https://dbfiddle.uk/xxm_Ujpm

WITH ordered AS (
    SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp, Volume, ctid) AS rn
    FROM input
)
WITH ordered AS (
    SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp, Volume, ctid) AS rn
    FROM input
)
WITH ordered AS (
    SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp, Volume, ctid) AS rn
    FROM input
)
SELECT
    o1.timestamp,
    (
        SELECT SUM(o2.Volume)
        FROM ordered o2
        WHERE o2.timestamp > o1.timestamp - INTERVAL '60 seconds'
          AND o2.timestamp <= o1.timestamp
          AND ( 
              o2.timestamp < o1.timestamp 
              OR (o2.timestamp = o1.timestamp AND o2.rn <= o1.rn) 
          )
    ) AS RollVolume
FROM ordered o1
ORDER BY o1.timestamp, o1.rn;

Upvotes: 0

MatBailie

Reputation: 86775

The simplest option is to run a cumulative sum from start to finish, without the time RANGE, then use a second cumulative sum to deduct the unwanted rows, with the time RANGE...

This ensures you can use an id column to enforce an ordering without running into errors when trying to use RANGE BETWEEN.

It also ensures the you only include rows "> 60s ago" rather than rows ">= 60s ago".

Performance wise, it only scans the data once, avoiding the cost of correlated sub-queries.

CREATE TABLE example (
  id    BIGINT GENERATED ALWAYS AS IDENTITY,
  x     INT, 
  ts    TIMESTAMP,
  val   INT
)

CREATE TABLE

INSERT INTO
  example (x, ts, val)
VALUES
  (1, '2024-11-16 08:00:00',    10),
  (1, '2024-11-16 08:00:00',    20), 
  (1, '2024-11-16 08:01:00',    30),
  (1, '2024-11-16 08:02:00',    40), 
  (1, '2024-11-16 08:02:00',    50)

INSERT 0 5

SELECT
  *, 
  SUM(val)
    OVER (
      PARTITION BY x
          ORDER BY ts, id
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  )
  -
  COALESCE(
    SUM(val)
      OVER (
         PARTITION BY x
             ORDER BY ts
        RANGE BETWEEN UNBOUNDED PRECEDING AND INTERVAL '60' SECOND PRECEDING
    )
    ,
    0
  )
    AS rolling_total 
FROM
  example

id	x	ts	val	rolling_total
1	1	2024-11-16 08:00:00	10	10
2	1	2024-11-16 08:00:00	20	30
3	1	2024-11-16 08:01:00	30	30
4	1	2024-11-16 08:02:00	40	40
5	1	2024-11-16 08:02:00	50	90

SELECT 5

fiddle

Upvotes: 1

How can I perform a SUM window function with a time range but handle duplicate timestamps row-wise in SQL

Answers (2)

Related Questions