Reputation: 594
How to group records by interval 4 hours?
For example, I have rows that contain data about events durations.
event_id | start_date | end_date |
---|---|---|
1 | 2024-08-16 14:30:00 | 2024-08-16 16:00:00 |
1 | 2024-08-16 16:00:00 | 2024-08-16 17:30:00 |
1 | 2024-08-16 17:30:00 | 2024-08-16 19:00:00 |
1 | 2024-08-16 19:00:00 | 2024-08-16 20:30:00 |
1 | 2024-08-16 20:30:00 | 2024-08-16 22:00:00 |
1 | 2024-08-16 22:00:00 | 2024-08-16 23:30:00 |
I want to group them but max duration of 1 row should be 4 hours.
Result:
event_id | start_date | end_date |
---|---|---|
1 | 2024-08-16 14:30:00 | 2024-08-16 17:30:00 |
1 | 2024-08-16 17:30:00 | 2024-08-16 20:30:00 |
1 | 2024-08-16 20:30:00 | 2024-08-16 23:30:00 |
I tried to use something like this to display an increased duration but no idea how to use it for grouping
select sum(end_date - start_date) over (partition by event_id order by start_date)
Upvotes: 0
Views: 58
Reputation: 6721
We have only one event id, so I have to guess what happens if you have several.
What I observe, in column start_date
of the result set, is a time series with evenly spaced intervals of 4 hours, not starting at a multiple of four hours, but at the earliest timestamp per event id.
And in end_date
of the result set, you have the start date of the subsequent row, if ordered by start_date
, or, in its absence, that is, in the last row for the event_id
, that row's end_date
.
Vertica has the TIMESERIES
clause, which is able to create an evenly spaced series of timestamps out of an irregular time series, with the spacing interval freely choosable.
The resulting time series, however, starts at a multiple of the chosen interval. In your case, at '2024-08-16 12:00:00' instead of '2024-08-16 14:30:00', that is, at the beginning of four hour time slice that the input timestamp belongs to.
Let's calculate the offset between those two timestamps to begin with, add the max end_date
, and take it from there:
WITH
-- your input ....
indata(event_id,start_date,end_date) AS (
SELECT 1,TIMESTAMP '2024-08-16 14:30:00',TIMESTAMP '2024-08-16 16:00:00'
UNION ALL SELECT 1,TIMESTAMP '2024-08-16 16:00:00',TIMESTAMP '2024-08-16 17:30:00'
UNION ALL SELECT 1,TIMESTAMP '2024-08-16 17:30:00',TIMESTAMP '2024-08-16 19:00:00'
UNION ALL SELECT 1,TIMESTAMP '2024-08-16 19:00:00',TIMESTAMP '2024-08-16 20:30:00'
UNION ALL SELECT 1,TIMESTAMP '2024-08-16 20:30:00',TIMESTAMP '2024-08-16 22:00:00'
UNION ALL SELECT 1,TIMESTAMP '2024-08-16 22:00:00',TIMESTAMP '2024-08-16 23:30:00'
)
-- end of input, real query starts here, replace following comma with WITH ....
,
off_set(event_id,o_start_dt,tslice,off_set,o_end_date) AS (
SELECT
event_id
, MIN(start_date)
, TIME_SLICE(MIN(start_date),4, 'HOUR')
, MIN(start_date) - TIME_SLICE(MIN(start_date),4, 'HOUR')
, MAX(end_date)
FROM indata
GROUP BY 1
-- CHK event_id | o_start_dt | tslice | off_set | o_end_date
-- CHK ----------+---------------------+---------------------+---------+---------------------
-- CHK 1 | 2024-08-16 14:30:00 | 2024-08-16 12:00:00 | 02:30 | 2024-08-16 23:30:00
)
SELECT
indata.event_id
-- you can't reference columns directly except the PARTITION BY column,
-- hence TS_LAST_VALUE() ...
, ts + TS_LAST_VALUE(off_set) AS start_date
, LEAST(
ts + TS_LAST_VALUE(off_set) + '4 HOURS'::INTERVAL
, TS_LAST_VALUE(o_end_date)
) AS end_date
FROM indata
JOIN off_set USING(event_id)
TIMESERIES ts AS '4 HOURS' OVER(PARTITION BY indata.event_id ORDER BY start_date)
-- out event_id | start_date | end_date
-- out ----------+---------------------+---------------------
-- out 1 | 2024-08-16 14:30:00 | 2024-08-16 18:30:00
-- out 1 | 2024-08-16 18:30:00 | 2024-08-16 22:30:00
-- out 1 | 2024-08-16 22:30:00 | 2024-08-16 23:30:00
Upvotes: 1
Reputation: 814
MIN(start_date) OVER (PARTITION BY event_id)
to find the earliest start_date
for each event_id
.FLOOR
to assign a group number based on the 4 hour interval.MIN(start_date)
and MAX(end_date)
to find the start and end times for each group.WITH grouping AS (
SELECT
event_id,
start_date,
end_date,
MIN(start_date) OVER (PARTITION BY event_id) AS first_start_date,
FLOOR(
TIMESTAMPDIFF('second', MIN(start_date) OVER (PARTITION BY event_id), start_date)
/ (4 * 3600)
) AS group_num
FROM event
)
SELECT
event_id,
MIN(start_date) AS start_date,
MAX(end_date) AS end_date
FROM grouping
GROUP BY event_id, group_num
ORDER BY event_id, start_date;
Upvotes: 1