user9532692
user9532692

Reputation: 706

Get time difference between Log records

I have a log table that tracks the bug's status. I would like to extract the amount of time spent when the log changes from OPEN (OldStatus) to FIXED or REQUEST CLOSE (NewStatus). Right now, my query looks at the max and min of the log which does not produce the result I want. For example, the bug #1 was fixed in 2 hours on 2020-01-01, then reopened (OldStatus) and got a REQUEST CLOSE (NewStatus) in 3 hours on 2020-12-12. I want the query result to return two rows with date and number of hours spent to fix the bug since its most recently opened time.

Here's the data:

CREATE TABLE Log (
    BugID int,
    CurrentTime timestamp,
    Person varchar(20),
    OldStatus varchar(20),
    NewStatus varchar(20)
);

INSERT INTO Log (BugID, CurrentTime, Person, OldStatus, NewStatus)
VALUES (1, '2020-01-01 00:00:00', 'A', 'OPEN', 'In Progress'),
       (1, '2020-01-01 00:00:01', 'A', 'In Progress', 'REVIEW In Progress'),
       (1, '2020-01-01 02:00:00', 'A', 'In Progress', 'FIXED'),
       (1, '2020-01-01 06:00:00', 'B', 'OPEN', 'In Progress'),
       (1, '2020-01-01 00:00:00', 'B', 'In Progress', 'REQUEST CLOSE')

SELECT DATEDIFF(HOUR, start_time, finish_time) AS Time_Spent_Min
FROM (
    SELECT  BugId, 
            MAX(CurrentTime) as finish_time, 
            MIN(CurrentTime) as start_time
            FROM Log
            WHERE (OldStatus = 'OPEN' AND NewString = 'In Progress') OR NewString = 'FIXED' 
) AS TEMP

The actual data looks as below:

original data

FYI @Charlieface

enter image description here

Upvotes: 0

Views: 1594

Answers (3)

Charlieface
Charlieface

Reputation: 71253

This is a type of gaps-and-islands problem.

There are a number of solutions, here is one:

  • We need to assign a grouping ID to each island of OPEN -> In Progress. We can use windowed conditional COUNT to get a grouping number for each start point.
  • To get a grouping for the end point, we need to assign the previous row's NewStatus using LAG, then do another conditional COUNT on that.
  • We then simply group by BugId and our calculated grouping and return the start and end times
WITH IslandStart AS (
    SELECT *,
        COUNT(CASE WHEN OldStatus = 'OPEN' AND NewStatus = 'In Progress' THEN 1 END)
          OVER (PARTITION BY BugID ORDER BY CurrentTime ROWS UNBOUNDED PRECEDING) AS GroupStart,
        LAG(NewStatus) OVER (PARTITION BY BugID ORDER BY CurrentTime) AS Prev_NewStatus
    FROM Log l
),
IslandEnd AS (
    SELECT *,
        COUNT(CASE WHEN Prev_NewStatus IN ('CLAIM FIXED', 'REQUEST CLOSE') THEN 1 END)
          OVER (PARTITION BY BugID ORDER BY CurrentTime ROWS UNBOUNDED PRECEDING) AS GroupEnd
    FROM IslandStart l
)
SELECT
    BugId, 
    MAX(CurrentTime) as finish_time, 
    MIN(CurrentTime) as start_time,
    DATEDIFF(minute, MIN(CurrentTime), MAX(CurrentTime)) AS Time_Spent_Min
FROM IslandEnd l
WHERE GroupStart = GroupEnd + 1
GROUP BY
  BugId,
  GroupStart;

Notes:

  • timestamp is not meant for actual dates and times, instead use datetime or datetime2
  • You may need to adjust the COUNT condition if OPEN -> In Progress is not always the first row of an island

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1269553

If I assume that every "open" is followed by one "fixed" before the next open, then you can basically use lead() to solve this problem.

This version unpivots the data, so you could have "open" and "fixed" in the same row:

select l.*, datediff(hour, currenttime, fixed_time)
from (select v.*,
             lead(v.currenttime) over (partition by v.bugid order by v.currenttime) as fixed_time
      from log l cross apply
           (values (bugid, currentTime, oldStatus),
                   (bugid, currentTime, newStatus)
           ) v(bugid, currentTime, status)
      where v.status in ('OPEN', 'FIXED')
     ) l
where status = 'OPEN';

Here is a db<>fiddle, which uses data compatible with your explanation. (Your sample data is not correct.)

Upvotes: 0

Chris Schaller
Chris Schaller

Reputation: 16554

You have a few competing factors here:

  1. You should use a SmallDateTime, DateTime2 or DateTimeOffset typed columns to store the actual time in the log, these types allow for calculating the differece between values using DateDiff() and DateAdd() and other date/time based comparison logic, where as Timestamp is designed to be used as a currency token, you can use it to determine if one record is more recent than another, you shouldn't try to use it to determine the actual time of the event.

  2. You have not explained the expected workflow, we can only assume that the flow is [OPEN]=>[In Progress]=>[CLAIM FIXED]. There is also no mention of 'In Progress', which we assume is an interim state. What actually happens here is that this structure can really only tell you the time spent in the 'In Progress' state, which is probably OK for your needs as this is the time spent actually working, but it is important to recognise that we do not know when the bug is changed to 'OPEN' in the first place, unless that is also logged but we need to see the data to explain that.

  3. Your example dataset does not cover enough combinations for you to notice that the existing logic will fail as soon as you add more than 1 bug. What is more you have asked to calculate the number of hours, but your example data only shows a variation minutes and has no example where the bug is completed at all.

    • Without a realistic set of data to test with, you will find it hard to debug your logic and hard to accept that it actually works before you execute this against a larger dataaset. It can help to have a scripted scenario, much like your post here, but you should create the data to reflect that script.
    • You use 'FIXED' in your example, but 'CLAIM FIXED' in query, so which one is it?

Step 1: Structure

Change the datatype of CurrentTime to a DateTime based column. Your application logic may drive requirements here. If your system is cloud based or international, then you may see benefits from using DateTimeOffset instead of having to convert into UTC, otherwise if you do not need high precision timing in your logs, it is very common to use SmallDateTime for logging.

  • Many ORM and application frameworks will allow you to configure a DateTime based column as the concurrency token, it you need one at all. If you are not happy using a lower precision value for concurrency, then you could have the two columns side by side, to compare the time difference between two records, we need to use a DateTime based type.
  • In the case of log, we rarely allow or expect logs to be edited, if your logs are read-only then having a concurrency token at all may not be necessary, especially if you only use the concurrency token to determine concurrency during edits of individual records.

NOTE: You should consider using an enum or FK for the Status concept. Already in your example dataset there was a typo for 'In Progerss', using a numeric comparison for the status may provide some performance benefits but it will help to prevent spelling mistakes, especially when FK or lookup lists are used from any application logic.

Step 2: Example Data

If the requirement is to calculate the number of hours spent between records, then we need to create some simple examples that show a difference of a few hours, and then add some examples where the same bug is opened, fixed and then re-opened.

  • bug #1 was fixed in 2 hours on 2020-01-01, then reopened and got fixed in 3 hours on 2020-12-12

The following table shows the known data states and the expected hrs, we need to throw in a few more data stories to validate that the end query handles obvious boundary conditions like multiple Bugs and overlapping dates

BUG # Time Previous State New State Hrs In Progress
1 2020-01-01 08:00:00 OPEN In Progress
1 2020-01-01 10:00:00 In Progress FIXED (2 hrs)
1 2020-12-10 09:00:00 FIXED OPEN
1 2020-12-12 9:30:00 OPEN In Progress
1 2020-12-12 12:30:00 In Progress FIXED (3 hrs)
2 2020-03-17 11:15:00 OPEN In Progress
2 2020-03-17 14:30:00 In Progress FIXED (3.25 hrs)
3 2020-08-22 10:00:00 OPEN In Progress
3 2020-08-22 16:30:00 In Progress FIXED (6.5 hrs)

Step 3: Query

What is interesting to notice here is that 'In Progress' is actually the significant state to query against. What we actually want is to see all rows where the OldStatus is 'In Progress' and we want to link that row to the most recent record before this one with the same BugID and with a NewStatus equal to 'In Progress'

What is interesting in the above table is that not all the expected hours are whole numbers (integers) which makes using DateDiff a little bit tricky because it only counts the boundary changes, not the total number of hours. to highlight this, look at the next two queries, the first one represents 59 minutes, the other only 2 minutes:

SELECT DateDiff(HOUR, '2020-01-01 08:00:00', '2020-01-01 08:59:00') -- 0 (59m)
SELECT DateDiff(HOUR, '2020-01-01 08:59:00', '2020-01-01 09:01:00') -- 1 (1m)

However the SQL results show the first query as 0 hours, but the second query returns 1 hour. That is because it only compares the HOUR column, it is not actually doing a subtraction of the time value at all.

To work around this, we can use MINUTE or MI as the date part argument and divide the result by 60.

SELECT CAST(ROUND(DateDiff(MI, '2020-01-01 08:00:00', '2020-01-01 08:59:00')/60.0,2) as Numeric(10,2)) -- 0.98
SELECT CAST(ROUND(DateDiff(MI, '2020-01-01 08:59:00', '2020-01-01 09:01:00')/60.0,2) as Numeric(10,2)) -- 0.03

You can choose to format this in other ways by calculating the modulo to get the minutes in whole numbers instead of a fraction but that is out of scope for this post, understanding the limitations of DateDiff is what is important to take this further.

There are a number of ways to correlate a previous record within the same table, if you need other values form the record then you might use a join with a sub-query to return the TOP 1 from all the records before the current one, you could use window queries or a CROSS APPLY to perform a nested lookup. The following uses CROSS APPLY which is NOT standard across all RDBMS but I feel it keeps MS SQL queries really clean:

SELECT [Fixed].BugID, [start_time], [Fixed].[CurrentTime] as [finish_time]
     , DATEDIFF(MI, [start_time], [Fixed].[CurrentTime]) / 60 AS Time_Spent_Hr 
     , DATEDIFF(MI, [start_time], [Fixed].[CurrentTime]) % 60 AS Time_Spent_Min 
FROM Log as Fixed
CROSS APPLY (SELECT MAX(CurrentTime) AS start_time 
             FROM Log as Started
             WHERE Fixed.BugID = Started.BugID 
               AND Started.NewStatus = 'In Progress'
               AND CurrentTime < Fixed.CurrentTime) as Started
WHERE Fixed.OldStatus = 'In Progress'

You can play with this fiddle: http://sqlfiddle.com/#!18/c408d4/3 However the results show this:

BugID start_time finish_time Time_Spent_Hr Time_Spent_Min
1 2020-01-01T08:00:00Z 2020-01-01T10:00:00Z 2 0
1 2020-12-12T09:30:00Z 2020-12-12T12:30:00Z 3 0
2 2020-03-17T11:15:00Z 2020-03-17T14:30:00Z 3 15
3 2020-08-22T10:00:00Z 2020-08-22T16:30:00Z 6 30

Upvotes: 2

Related Questions