Brian
Brian

Reputation: 91

SQL Server - cumulative sum on overlapping data - getting date that sum reaches a given value

In our company, our clients perform various activities that we log in different tables - Interview attendance, Course Attendance, and other general activities. I have a database view that unions data from all of these tables giving us the ActivityView that looks like this. As you can see some activities overlap - for example while attending an interview, a client may have been performing a CV update activity.

+----------------------+---------------+---------------------+-------------------+
| activity_client_id   | activity_type | activity_start_date | activity_end_date |
+----------------------+---------------+---------------------+-------------------+
|                  112 | Interview     | 2015-06-01 09:00    | 2015-06-01 11:00  |
|                  112 | CV updating   | 2015-06-01 09:30    | 2015-06-01 11:30  |
|                  112 | Course        | 2015-06-02 09:00    | 2015-06-02 16:00  |
|                  112 | Interview     | 2015-06-03 09:00    | 2015-06-03 10:00  |
+----------------------+---------------+---------------------+-------------------+

Each client has a "Sign Up Date", recorded on the client table, which is when they joined our programme. Here it is for our sample client:

+-----------+---------------------+
| client_id | client_sign_up_date |
+-----------+---------------------+
|       112 | 2015-05-20          |
+-----------+---------------------+

I need to create a report that will show the following columns:

+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+

We need this report in order to see how effective our programme is. An important aim of the programme is that we get every client to complete at least 5 hours of activity as quickly as possible. So this report will tell us how long from sign up does it take each client to achieve this figure.

What makes this even trickier is that when we calculate 5 hours of total activity, we must discount overlapping activities:

In the sample data above the client attended an interview between 09:00 and 11:00.
On the same day they also performed CV updating activity from 09:30 to 11:30. For our calculation, this would give them total activity for the day of 2.5 hours (150 minutes) - we would only count 30 minutes of the CV updating as the Interview overlaps it up to 11:00.

So the report for our sample client would give the following result:

+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+
|       112 | 2015-05-20          | 2015-06-02                                 |
+-----------+---------------------+--------------------------------------------+

So my question is how can I create the report using a select statement ? I can work out how to do this by writing a stored procedure that will loop through the view and write the result to a report table. But I would much prefer to avoid a stored procedure and have a select statement that will give me the report on the fly.

I am using SQL Server 2005.

Upvotes: 5

Views: 1341

Answers (3)

pwilcox
pwilcox

Reputation: 5763

A Geometric Approach

For another issue, I've taken a geometric approach to date packing. Namely, I convert dates and times to a sql geometry type and utilize geometry::UnionAggregate to merge the ranges.

I don't believe this will work in sql-server 2005. But your problem was such an interesting puzzle that I wanted to see whether the geometrical approach would work. So any future users running into this problem that have access to a later version can consider it.

Code Description

In 'numbers':

  • I build a table representing a sequence
  • Swap it out with your favorite way to make a numbers table.
  • For a union operation, you won't ever need more rows than in your original table, so I just use it as the base to build it.

In 'mergeLines':

  • I convert the dates to floats and use those floats to create geometrical points.
  • I then connect these points via STUnion and STEnvelope.
  • Finally, I merge all these lines via UnionAggregate. The resulting 'lines' geometry object might contain multiple lines, but if they overlap, they turn into one line.

In 'redate':

  • I use the numbers CTE to extract the individual lines inside 'lines'.
  • I envelope the lines which here ensures that the lines are stored only as its two endpoints.
  • I read the endpoint x values and convert them back to their time representations (This is usually the end goal, but you need more).
  • I calculate the difference in minutes between activity start and end dates (I do this first in seconds then divide by 60 for the sake of a precision issue).
  • I calculate the cumulative sume of these minutes for each row.

In the outer query:

  • I align the previous cumulative minutes sum with each current row
  • I filter for the row where the 5hr goal was met but where the previous minutes shows that the 5hr goal for the previous row was not met.
  • I then calculate where in the current row's range the user has met the 5 hours, to not only arrive at the date the five hour goal was met, but the exact time.

The Code

with

    numbers as (

        select  row_number() over (order by (select null)) i 
        from    @activities -- where I put your data

    ),

    mergeLines as (

        select      activity_client_id,
                    lines = geometry::UnionAggregate(line)
        from        @activities
        cross apply (select 
                        startP = geometry::Point(convert(float,activity_start_date), 0, 0),
                        stopP = geometry::Point(convert(float,activity_end_date), 0, 0)
                    ) pointify
        cross apply (select line = startP.STUnion(stopP).STEnvelope()) lineify
        group by    activity_client_id

    ),

    redate as (

        select      client_id = activity_client_id, 
                    activities_start_date,
                    activities_end_date,
                    minutes,

                    rollingMinutes = sum(minutes) over(
                        partition by activity_client_id 
                        order by activities_start_date 
                        rows between unbounded preceding and current row
                    )

        from        mergeLines ml
        join        numbers n on n.i between 1 and ml.lines.STNumGeometries()
        cross apply (select line = ml.lines.STGeometryN(i).STEnvelope()) l
        cross apply (select 
                        activities_start_date = convert(datetime, l.line.STPointN(1).STX),
                        activities_end_date = convert(datetime, l.line.STPointN(3).STX)
                    ) unprepare
        cross apply (select minutes = 
                        round(datediff(s, activities_start_date, activities_end_date) / 60.0,0)
                    ) duration

    )

    select      client_id,
                activities_start_date,
                activities_end_date,
                met_5hr_goal = dateadd(minute, (60 * 5) - prevRoll, activities_start_date) 
    from        (
                    select  *,
                            prevRoll = lag(rollingMinutes) over (
                                partition by client_id 
                                order by rollingMinutes
                            )
                    from    redate 
                ) ranker
    where       rollingMinutes >= 60 * 5
    and         prevRoll < 60 * 5;

Upvotes: 0

Mitan Shah
Mitan Shah

Reputation: 344

See SQL Fiddle here.

with tbl as (
  -- this will generate daily merged ovelaping time
  select distinct
    a.id
    ,(
        select min(x.starttime) 
        from act x 
        where x.id=a.id and ( x.starttime between a.starttime and a.endtime
          or a.starttime between x.starttime and x.endtime )
    ) start1
    ,(
        select max(x.endtime) 
        from act x 
        where x.id=a.id and ( x.endtime between a.starttime and a.endtime
          or a.endtime between x.starttime and x.endtime )
    ) end1
  from act a

), tbl2 as 
(
  -- this will add minute and total minute column
  select 
    * 
    ,datediff(mi,t.start1,t.end1) mi
    ,(select sum(datediff(mi,x.start1,x.end1)) from tbl x where x.id=t.id and x.end1<=t.end1) totalmi
  from tbl t
), tbl3 as 
(
  -- now final query showing starttime and endtime for 5 hours other wise null in case not completed 5(300 minutes) hours
  select 
    t.id
    ,min(t.start1) starttime
    ,min(case when t.totalmi>300 then t.end1 else null end) endtime
  from tbl2 t
  group by t.id
)
-- final result 
select *
from tbl3
where endtime is not null

Upvotes: 2

Giorgos Betsos
Giorgos Betsos

Reputation: 72225

This is one way to do it:

;WITH CTErn AS (
   SELECT activity_client_id, activity_type,
          activity_start_date, activity_end_date,
          ROW_NUMBER() OVER (PARTITION BY activity_client_id 
                             ORDER BY activity_start_date) AS rn
   FROM activities
),   
CTEdiff AS (
   SELECT c1.activity_client_id, c1.activity_type,
          x.activity_start_date, c1.activity_end_date,
          DATEDIFF(mi, x.activity_start_date, c1.activity_end_date) AS diff,
          ROW_NUMBER() OVER (PARTITION BY c1.activity_client_id 
                             ORDER BY x.activity_start_date) AS seq
   FROM CTErn AS c1
   LEFT JOIN CTErn AS c2 ON c1.rn = c2.rn + 1
   CROSS APPLY (SELECT CASE 
                          WHEN c1.activity_start_date < c2.activity_end_date
                             THEN c2.activity_end_date
                          ELSE c1.activity_start_date
                       END) x(activity_start_date)    
)
SELECT TOP 1 client_id, client_sign_up_date, activity_start_date, 
             hoursOfActivicty               
FROM CTEdiff AS c1
INNER JOIN clients AS c2 ON c1.activity_client_id = c2.client_id                     
CROSS APPLY (SELECT SUM(diff) / 60.0
             FROM CTEdiff AS c3
             WHERE c3.seq <= c1.seq) x(hoursOfActivicty)
WHERE hoursOfActivicty >= 5
ORDER BY seq

Common Table Expressions and ROW_NUMBER() were introduced with SQL Server 2005, so the above query should work for that version.

Demo here

The first CTE, i.e. CTErn, produces the following output:

client_id   activity_type   start_date          end_date          rn
112         Interview       2015-06-01 09:00    2015-06-01 11:00  1
112         CV updating     2015-06-01 09:30    2015-06-01 11:30  2
112         Course          2015-06-02 09:00    2015-06-02 16:00  3
112         Interview       2015-06-03 09:00    2015-06-03 10:00  4

The second CTE, i.e. CTEdiff, uses the above table expression in order to calculate time difference for each record, taking into consideration any overlapps with the previous record:

client_id activity_type start_date       end_date         diff  seq
112       Interview     2015-06-01 09:00 2015-06-01 11:00 120   1
112       CV updating   2015-06-01 11:00 2015-06-01 11:30 30    2
112       Course        2015-06-02 09:00 2015-06-02 16:00 420   3
112       Interview     2015-06-03 09:00 2015-06-03 10:00 60    4

The final query calculates the cumulative sum of time difference and selects the first record that exceeds 5 hours of activity.

The above query will work for simple interval overlaps, i.e. when just the end date of an activity overlaps the start date of the next activity.

Upvotes: 1

Related Questions