Reputation: 103
I'm using Microsoft SQL Server 2016.
There is a database table "Raw_data", that contains the status of a machine, together with it's starting time. There are several machines and each one writes it's status to the database multiple times per minute.
To reduce the data volume I'm trying to aggregate the data into 1-Minute chunks to save it for further analysis. Due to a capacity constraint, I want to execute this transition-logic every few minutes (e.g. scheduled SQL Server Agent Job), delete the raw data and just keep the aggregated data.
To simplify the example, let's assume "Raw_data" looks something like this:
╔════╦════════════╦════════╦═════════════════════╗
║ id ║ fk_machine ║ status ║ created_at ║
╠════╬════════════╬════════╬═════════════════════╣
║ 1 ║ 2222 ║ 0 ║ 2020-08-19 22:15:00 ║
║ 2 ║ 2222 ║ 3 ║ 2020-08-19 22:15:30 ║
║ 3 ║ 2222 ║ 5 ║ 2020-08-19 23:07:00 ║
║ 4 ║ 2222 ║ 1 ║ 2020-08-20 00:20:00 ║
║ 5 ║ 2222 ║ 0 ║ 2020-08-20 00:45:00 ║
║ 6 ║ 2222 ║ 5 ║ 2020-08-20 02:20:00 ║
╚════╩════════════╩════════╩═════════════════════╝
Also there are database tables "Dim_date" and "Dim_time", that look something like that:
╔══════════╦══════════════╗
║ datekey ║ date_iso8601 ║
╠══════════╬══════════════╣
║ 20200101 ║ 2020-01-01 ║
║ 20200102 ║ 2020-01-02 ║
║ ... ║ ... ║
║ 20351231 ║ 2035-12-31 ║
╚══════════╩══════════════╝
╔═════════╦══════════╦═════════════════╗
║ timekey ║ time_iso ║ min_lower_bound ║
╠═════════╬══════════╬═════════════════╣
║ 1 ║ 00:00:01 ║ 00:00:00 ║
║ 2 ║ 00:00:02 ║ 00:00:00 ║
║ ... ║ ... ║ ... ║
║ 80345 ║ 08:03:45 ║ 08:03:00 ║
║ ... ║ ... ║ ... ║
║ 134504 ║ 13:45:04 ║ 13:45:00 ║
║ 134505 ║ 14:45:05 ║ 13:45:00 ║
║ ... ║ ... ║ ... ║
║ 235959 ║ 23:59:59 ║ 23:59:59 ║
╚═════════╩══════════╩═════════════════╝
The result should look like this:
╔══════════════╦═════════════════╦════════════╦════════╦═══════════════╗
║ date_iso8601 ║ min_lower_bound ║ fk_machine ║ status ║ total_seconds ║
╠══════════════╬═════════════════╬════════════╬════════╬═══════════════╣
║ 2020-08-19 ║ 22:15:00 ║ 2222 ║ 0 ║ 30 ║
║ 2020-08-19 ║ 20:15:00 ║ 2222 ║ 3 ║ 30 ║
║ 2020-08-19 ║ 20:16:00 ║ 2222 ║ 3 ║ 60 ║
║ 2020-08-19 ║ 20:17:00 ║ 2222 ║ 3 ║ 60 ║
║ ... ║ ... ║ ... ║ ... ║ ... ║
║ 2020-08-19 ║ 23:06:00 ║ 2222 ║ 3 ║ 60 ║
║ 2020-08-19 ║ 23:07:00 ║ 2222 ║ 5 ║ 60 ║
║ 2020-08-19 ║ 23:08:00 ║ 2222 ║ 5 ║ 60 ║
║ ... ║ ... ║ ... ║ ... ║ ... ║
║ 2020-08-20 ║ 00:19:00 ║ 2222 ║ 5 ║ 60 ║
║ 2020-08-20 ║ 00:20:00 ║ 2222 ║ 1 ║ 60 ║
║ 2020-08-20 ║ 00:21:00 ║ 2222 ║ 1 ║ 60 ║
║ ... ║ ... ║ ... ║ ... ║ ... ║
║ 2020-08-20 ║ 00:44:00 ║ 2222 ║ 1 ║ 60 ║
║ 2020-08-20 ║ 00:45:00 ║ 2222 ║ 0 ║ 60 ║
╚══════════════╩═════════════════╩════════════╩════════╩═══════════════╝
To calculate the duration of each status per minute I used an CTE and LEAD to fetch the starting date and time from the next status in the database table, then joined with the dimension tables and aggregated the result.
WITH CTE_MACHINE_STATES(START_DATEKEY,
START_TIMEKEY,
FK_MACHINE,
END_DATEKEY,
END_TIMEKEY)
AS (SELECT CAST(CONVERT(CHAR(8), CREATED_AT, 112) AS INT), -- ISO: yyyymmdd
CONVERT(INT, REPLACE(CONVERT(CHAR(8), READING_TIME, 108), ':', '')),
FK_MACHINE,
STATUS,
CAST(CONVERT(CHAR(8), LEAD(CREATED_AT, 1) OVER(PARTITION BY FK_MACHINE
ORDER BY CREATED_AT), 112) AS INT),
CONVERT(INT, REPLACE(CONVERT(CHAR(8), LEAD(CREATED_AT, 1) OVER(PARTITION BY FK_MACHINE
ORDER BY CREATED_AT), 108), ':', ''))
FROM RAW_DATA)
SELECT DATE_ISO8601,
MIN_LOWER_BOUND,
FK_MACHINE,
STATUS,
SUM(1) AS TOTAL_SECONDS -- Duration
FROM CTE_MACHINE_STATES
CROSS JOIN DIM_DATE
CROSS JOIN DIM_TIME
WHERE TIMEKEY >= START_TIMEKEY AND
TIMEKEY < END_TIMEKEY AND
END_TIMEKEY IS NOT NULL AND -- last entry per machine and status
DATEKEY BETWEEN START_DATEKEY AND END_DATEKEY
GROUP BY FK_MACHINE,
STATUS,
DATE_ISO8610,
MIN_LOWER_BOUND
ORDER BY DATE_ISO8610,
MIN_LOWER_BOUND;
If the status lasts past midnight it won't be aggregated correctly. For example the status at id = 3 in "Raw_data" starts at 23:07 and ends on 00:20 the next day. Here, timekey is greater than end_timekey, so the status get's excluded from the resulting table by the filter TIMEKEY < END_TIMEKEY
. I haven't come up with a solution on how to change the join-condition to include such long-lasting states, but get the expected result.
PS: I already wrote, that normally status-updates are happening every several seconds. Thus, the problem only occurs in edge cases, e.g. if a machine get's turned off.
Unfortunately I did not receive an answer on how to get the expected result using the date- and time dimension tables. But dnoeth's approach using a recursive CTE is good, so I went with it:
WITH cte_outer AS (
SELECT fk_machine,
status,
created_at,
DATEADD(minute, DATEDIFF(minute, '2000', created_at), '2000') AS min_lower_bound, --truncates seconds from start time
LEAD(created_at) OVER(PARTITION BY fk_machine ORDER BY created_at) AS end_time
FROM raw_data
),
cte_recursive AS (
SELECT fk_machine,
status,
min_lower_bound,
end_time,
CASE
WHEN end_time > DATEADD(minute, 1, min_lower_bound)
THEN DATEDIFF(s, created_at, DATEADD(minute, 1, min_lower_bound))
ELSE DATEDIFF(s, created_at, end_time)
END AS total_seconds
FROM cte_outer
UNION ALL
SELECT fk_machine,
status,
DATEADD(minute, 1, min_lower_bound), -- next time segment (minute)
end_time,
CASE
WHEN end_time >= DATEADD(minute, 2, min_lower_bound)
THEN 60
ELSE DATEDIFF(s, DATEADD(minute, 1, min_lower_bound), end_time)
END
FROM cte_recursive
WHERE end_time > DATEADD(minute, 1, min_lower_bound)
)
SELECT min_lower_bound,
fk_machine,
status,
total_seconds
FROM cte_recursive
ORDER BY fk_machine,
min_lower_bound
Upvotes: 3
Views: 519
Reputation: 60472
This is a use-case for a recursive CTE, increasing created_at
by one minute per recursion:
with cte as
(
select fk_machine
,status
,start_minute
,end_time
,case
when end_time > dateadd(minute, 1,start_minute)
then datediff(s, created_at, dateadd(minute, 1,start_minute))
else datediff(s, created_at, end_time )
end as seconds
from
(
select fk_machine
,status
,created_at
,dateadd(minute, datediff(minute, 0, created_at), 0) as start_minute
,lead(created_at)
over (PARTITION BY fk_machine
order by created_at) as end_time
from tab
) as dt
union all
select fk_machine
,status
,dateadd(minute, 1,start_minute)
,end_time
,case
when end_time >= dateadd(minute, 2,start_minute)
then 60
else datediff(s, dateadd(minute, 1,start_minute), end_time)
end
from cte
where end_time > dateadd(minute, 1,start_minute)
)
select * from cte
order by 1,3,4;
See fiddle
Upvotes: 2
Reputation: 143
For something like this, concatenating the keys to a single datetime isn’t as costly as it might seem. Then you can call DATEDIFF() to check for positive, negative, absolute, values for the comparison. I’ve run something similar translating instantaneous data to minute aggregates across multiple decades, and datediff really makes the difference. However, this would do much better if you simply pull the raw data and perform the calculations in a language with a good datetime library. SQL is always the answer until it isn’t.
What’s likely causing one of the problems here is the following statement:
WHERE TIMEKEY >= START_TIMEKEY AND
TIMEKEY < END_TIMEKEY AND
END_TIMEKEY IS NOT NULL AND
DATEKEY BETWEEN START_DATEKEY AND END_DATEKEY
If the date and time aren’t separated, you can say:
WHERE DateTimeKey >= START_DateTimeKey AND
DateTimeKey < END_DateTimeKey AND
END_TIME-KEY IS NOT NULL
If you are trying to aggregate by a time value, it would be helpful to eliminate any timekey table, that may be another source of problems. It may be a good idea to replace the timekey table with a recursion and a period duration. You will also need to account for these conditions:
End time of the event must always be after the start time of the aggregate period start time:
DateDiff(second, Period_Start_Time, Event_End) > 0
Start time of the event must always be before the end of the aggregate period end time:
DateDiff(second, Period_Start_Time, Event_Start) <= @Period_Duration
There are several ways to distribute the event data across the periods, but datediff helps with linear distribution as well.
Upvotes: 2