Paul
Paul

Reputation: 594

SQL VERTICA group by intervals

How to group records by interval 4 hours?

For example, I have rows that contain data about events durations.

event_id start_date end_date
1 2024-08-16 14:30:00 2024-08-16 16:00:00
1 2024-08-16 16:00:00 2024-08-16 17:30:00
1 2024-08-16 17:30:00 2024-08-16 19:00:00
1 2024-08-16 19:00:00 2024-08-16 20:30:00
1 2024-08-16 20:30:00 2024-08-16 22:00:00
1 2024-08-16 22:00:00 2024-08-16 23:30:00

I want to group them but max duration of 1 row should be 4 hours.

Result:

event_id start_date end_date
1 2024-08-16 14:30:00 2024-08-16 17:30:00
1 2024-08-16 17:30:00 2024-08-16 20:30:00
1 2024-08-16 20:30:00 2024-08-16 23:30:00

I tried to use something like this to display an increased duration but no idea how to use it for grouping

select sum(end_date - start_date) over (partition by event_id order by start_date)

Upvotes: 0

Views: 58

Answers (2)

marcothesane
marcothesane

Reputation: 6721

We have only one event id, so I have to guess what happens if you have several.

What I observe, in column start_date of the result set, is a time series with evenly spaced intervals of 4 hours, not starting at a multiple of four hours, but at the earliest timestamp per event id.

And in end_date of the result set, you have the start date of the subsequent row, if ordered by start_date, or, in its absence, that is, in the last row for the event_id, that row's end_date.

Vertica has the TIMESERIES clause, which is able to create an evenly spaced series of timestamps out of an irregular time series, with the spacing interval freely choosable.

The resulting time series, however, starts at a multiple of the chosen interval. In your case, at '2024-08-16 12:00:00' instead of '2024-08-16 14:30:00', that is, at the beginning of four hour time slice that the input timestamp belongs to.

Let's calculate the offset between those two timestamps to begin with, add the max end_date, and take it from there:

WITH                                                                                                                                                
-- your input ....
indata(event_id,start_date,end_date) AS (
          SELECT 1,TIMESTAMP '2024-08-16 14:30:00',TIMESTAMP '2024-08-16 16:00:00'
UNION ALL SELECT 1,TIMESTAMP '2024-08-16 16:00:00',TIMESTAMP '2024-08-16 17:30:00'
UNION ALL SELECT 1,TIMESTAMP '2024-08-16 17:30:00',TIMESTAMP '2024-08-16 19:00:00'
UNION ALL SELECT 1,TIMESTAMP '2024-08-16 19:00:00',TIMESTAMP '2024-08-16 20:30:00'
UNION ALL SELECT 1,TIMESTAMP '2024-08-16 20:30:00',TIMESTAMP '2024-08-16 22:00:00'
UNION ALL SELECT 1,TIMESTAMP '2024-08-16 22:00:00',TIMESTAMP '2024-08-16 23:30:00'
)
-- end of input, real query starts here, replace following comma with WITH .... 
,
off_set(event_id,o_start_dt,tslice,off_set,o_end_date) AS (
  SELECT
    event_id
  , MIN(start_date)
  , TIME_SLICE(MIN(start_date),4, 'HOUR')
  , MIN(start_date) - TIME_SLICE(MIN(start_date),4, 'HOUR')
  , MAX(end_date)
  FROM indata
  GROUP BY 1
-- CHK  event_id |     o_start_dt      |       tslice        | off_set |     o_end_date      
-- CHK ----------+---------------------+---------------------+---------+---------------------
-- CHK         1 | 2024-08-16 14:30:00 | 2024-08-16 12:00:00 | 02:30   | 2024-08-16 23:30:00
)
SELECT
  indata.event_id
 -- you can't reference columns directly except the PARTITION BY column,
 -- hence TS_LAST_VALUE() ... 
,       ts + TS_LAST_VALUE(off_set) AS start_date
, LEAST(
    ts + TS_LAST_VALUE(off_set) + '4 HOURS'::INTERVAL
  , TS_LAST_VALUE(o_end_date)
  ) AS end_date
FROM indata
JOIN off_set USING(event_id)
TIMESERIES ts AS '4 HOURS' OVER(PARTITION BY indata.event_id ORDER BY start_date)
-- out  event_id |     start_date      |      end_date       
-- out ----------+---------------------+---------------------
-- out         1 | 2024-08-16 14:30:00 | 2024-08-16 18:30:00
-- out         1 | 2024-08-16 18:30:00 | 2024-08-16 22:30:00
-- out         1 | 2024-08-16 22:30:00 | 2024-08-16 23:30:00

Upvotes: 1

keithwalsh
keithwalsh

Reputation: 814

  • MIN(start_date) OVER (PARTITION BY event_id) to find the earliest start_date for each event_id.
  • Use FLOOR to assign a group number based on the 4 hour interval.
  • MIN(start_date) and MAX(end_date) to find the start and end times for each group.
WITH grouping AS (
    SELECT 
        event_id, 
        start_date, 
        end_date,
        MIN(start_date) OVER (PARTITION BY event_id) AS first_start_date,
        FLOOR(
            TIMESTAMPDIFF('second', MIN(start_date) OVER (PARTITION BY event_id), start_date) 
            / (4 * 3600)
        ) AS group_num
    FROM event
)
SELECT
    event_id,
    MIN(start_date) AS start_date,
    MAX(end_date) AS end_date
FROM grouping
GROUP BY event_id, group_num
ORDER BY event_id, start_date;

Upvotes: 1

Related Questions