Reputation: 149
In SQL Server 2014 I have a Periods
table that looks like the following:
| PeriodId | PeriodStart | PeriodEnd |
---------------------------------------
| 202005 | 2020-05-01 | 2020-05-31 |
| 202006 | 2020-06-01 | 2020-06-30 |
A period won't always be from the first to the last day of the month.
Then I have an Activities
table, which has some activities the user has programmed:
| ActivityId | UserId | ActivityStart | ActivityEnd |
-----------------------------------------------------
| 1 | A | 2020-05-20 | 2020-06-05 |
| 2 | A | 2020-06-15 | 2020-06-18 |
| 3 | B | 2020-06-10 | 2020-06-25 |
There can be gaps between the activities of a user, but the same user will never have overlaping activities.
Now I need a query that limits the activities dates ranges to the start and end of the period, and fills the gaps to complete the period. I'll always filter by PeriodId
, so I'll just put the example result for PeriodId = 202006
:
| PeriodId | UserId | ActivityId | NewActivityStart | NewActivityEnd |
----------------------------------------------------------------------
| 202006 | A | 1 | 2020-06-01 | 2020-06-05 | --Part of ActivityId 1
| 202006 | A | NULL | 2020-06-06 | 2020-06-14 | --Fill between activities 1 and 2
| 202006 | A | 2 | 2020-06-15 | 2020-06-18 |
| 202006 | A | NULL | 2020-06-19 | 2020-06-30 | --Fill until end of period
| 202006 | B | NULL | 2020-06-01 | 2020-06-09 | --Fill from start of period
| 202006 | B | 3 | 2020-06-10 | 2020-06-25 |
| 202006 | B | NULL | 2020-06-26 | 2020-06-30 | --Fill until end of period
I've been able to contain the activity dates within the period with the following query:
SELECT p.PeriodId, a.UserId, a.ActivityId
IIF(p.PeriodStart > a.ActivityStart, p.PeriodStart, a.ActivityStart) AS NewActivityStart,
IIF(p.PeriodEnd < a.ActivityEnd, p.PeriodEnd, a.ActivityEnd) AS NewActivityEnd
FROM Periods p
JOIN Activities a ON a.ActivityStart <= p.PeriodEnd AND a.ActivityEnd >= p.PeriodStart
But I haven't been able to fill the gaps in the ranges. I've tried with a correlative dates table and/or with Window Functions like LAG/LEAD.
I feel like Window Functions could be the solution, and I've tried to follow examples about gaps/islands, but I just haven't been able to understand them well enough to make it work.
Is there a way to complete the query to fill the missing gaps? Are there other ways to achieve this in a query?
Upvotes: 2
Views: 1617
Reputation:
This is not the craziest gaps problem I've seen, but it's a good one.
DECLARE @PeriodId int = 202006;
DECLARE @ps date, @pe date;
SELECT @ps = PeriodStart, @pe = PeriodEnd FROM dbo.Periods
WHERE PeriodId = @PeriodId;
;WITH dates(rn,dt) AS
(
SELECT 1, @ps UNION ALL SELECT rn + 1, DATEADD(DAY, rn, @ps)
FROM dates WHERE dt < @pe
)
groups(UserId, dt, ActivityId, grp) AS
(
SELECT u.UserId, d.dt, r.ActivityId,
d.rn - DENSE_RANK() OVER (PARTITION BY u.UserId, r.ActivityStart ORDER BY d.dt)
FROM dates AS d CROSS JOIN (SELECT DISTINCT UserId FROM dbo.Activities
WHERE @pe >= ActivityStart AND @ps <= ActivityEnd) AS u
LEFT OUTER JOIN dbo.Activities AS r
ON u.UserId = r.UserId AND d.dt >= r.ActivityStart AND d.dt <= r.ActivityEnd
)
SELECT PeriodId = @PeriodId, UserId, ActivityId,
NewActivityStart = MIN(dt),
NewActivityEnd = MAX(dt)
FROM groups
GROUP BY UserId, ActivityId, grp
ORDER BY UserId, NewActivityStart;
If a period can be over 100 days, you need MAXRECURSION
at the end:
OPTION (MAXRECURSION 32767);
If a period can be more than 32,767 days, change 32767
to 0
.
Updated fiddle here.
Upvotes: 2
Reputation: 1269773
I don't think this is that complicated. If you expand the periods into individual dates and do a left join
, then this becomes a gaps-and-islands problem:
with dates as (
select periodid, periodstart as dte, periodend
from periods
union all
select periodid, dateadd(day, 1, dte), periodend
from dates
where dte < periodend
)
select userid, activityid, min(dte), max(dte)
from (select d.dte, d.periodid, u.userid, a.activityid,
row_number() over (partition by u.userid, a.activityid order by d.dte) as seqnum
from dates d cross join
(select distinct userid from activities) u left join
activities a
on a.userid = u.userid and
a.activitystart <= d.dte and a.activityend >= d.dte
) da
group by userid, activityid, periodid, dateadd(day, -seqnum, dte)
order by userid, min(dte);
Here is a db<>fiddle.
Note: This produces results for all users and all periods -- which seems reasonable given your description. It is pretty simple to modify to filter out users with no activity during a given period.
Also, this does not go to the end of the month. Instead, it includes the complete periods. I don't see why months would play into this -- except to confuse matters -- consider if two periods have days in the same month, for instance.
Upvotes: 1
Reputation: 43636
You can solve this using various techniques. In the example below, I am using an approach as the code is a body of SQL routine.
So, here is your date:
DECLARE @Periods TABLE
(
[PeriodId] INT
,[PeriodStart] DATE
,[PeriodEnd] DATE
);
INSERT INTO @Periods ([PeriodId], [PeriodStart], [PeriodEnd])
VALUES ('202005', '2020-05-01', '2020-05-31')
,('202006', '2020-06-01', '2020-06-30');
DECLARE @Activities TABLE
(
[ActivityId] INT
,[UserId] CHAR(1)
,[ActivityStart] DATE
,[ActivityEnd] DATE
);
INSERT INTO @Activities ([ActivityId], [UserId], [ActivityStart], [ActivityEnd])
VALUES (1, 'A', '2020-05-20', '2020-06-05')
,(2, 'A', '2020-06-15', '2020-06-18')
,(3, 'B', '2020-06-10', '2020-06-25');
Then, let's say we have an input parameter @PeriodID
and via it we are extracting the corresponding start and end date periods:
DECLARE @PeriodID INT
,@PeriodDateStart DATE
,@PeriodDateEnd DATE;
SET @PeriodID = 202006;
SELECT @PeriodDateStart = [PeriodStart]
,@PeriodDateEnd = [PeriodEnd]
FROM @Periods
WHERE [PeriodId] = @PeriodID;
Then, let's create a buffer table in which we will calculated the matches between the activity
and the period
table and add start
and end
periods records if needed:
DECLARE @Buffer TABLE
(
[ActivityId] INT
,[UserId] CHAR(1)
,[ActivityStart] DATE
,[ActivityEnd] DATE
);
WITH DataSource AS
(
SELECT A.[ActivityId]
,A.[UserId]
,A.[ActivityStart]
,A.[ActivityEnd]
FROM @Activities A
INNER JOIN @Periods P
ON A.[ActivityStart] <= P.[PeriodEnd]
AND A.[ActivityEnd] >= P.[PeriodStart]
WHERE P.PeriodId = @PeriodID
)
INSERT INTO @Buffer ([ActivityId], [UserId], [ActivityStart], [ActivityEnd])
SELECT [ActivityId]
,[UserId]
,IIF([ActivityStart] < @PeriodDateStart, @PeriodDateStart, [ActivityStart]) AS [ActivityStart]
,[ActivityEnd]
FROM DataSource
UNION ALL
SELECT NULL
,[UserId]
,DATEADD(DAY, 1, MAX([ActivityEnd]))
,@PeriodDateEnd
FROM DataSource
GROUP BY [UserId]
HAVING DATEADD(DAY, 1, MAX([ActivityEnd])) < @PeriodDateEnd
UNION ALL
SELECT NULL
,[UserId]
,@PeriodDateStart
,DATEADD(DAY, -1, MIN([ActivityStart]))
FROM DataSource
GROUP BY [UserId]
HAVING DATEADD(DAY, -1, MIN([ActivityStart])) > @PeriodDateStart;
It's simple. In the common table expression I have used your code. And then, we just simply check if we need to add a record before or/and after the period for a specific user.
Now, we are ready to calculate the gaps, right? A lot of variants here. I am using the LEAD
function in order to calculate the missing
period for each row. The statement is below:
SELECT *
,DATEADD(DAY, 1, [ActivityEnd]) AS [MissingPeriodStart]
,DATEADD(DAY, -1, LEAD([ActivityStart]) OVER (PARTITION BY [UserID] ORDER BY [ActivityStart] ASC)) AS [MissingPeriodEnd]
FROM @Buffer
ORDER BY USERID, ActivityStart;
The output is like this:
So, you may see how we have generated missing periods
dates for each row, except the last one. Now, we need to get only some of these missing periods
. It's like this:
WITH DataSource AS
(
SELECT *
,DATEADD(DAY, 1, [ActivityEnd]) AS [MissingPeriodStart]
,DATEADD(DAY, -1, LEAD([ActivityStart]) OVER (PARTITION BY [UserID] ORDER BY [ActivityStart] ASC)) AS [MissingPeriodEnd]
FROM @Buffer
)
SELECT @PeriodID AS [PeriodID]
,[UserId]
,[ActivityId]
,[ActivityStart]
,[ActivityEnd]
FROM DataSource
UNION ALL
SELECT @PeriodID AS [PeriodID]
,[UserId]
,NULL
,[MissingPeriodStart]
,[MissingPeriodEnd]
FROM DataSource
WHERE NOT EXISTS
(
SELECT 1
FROM DataSource DS
WHERE [MissingPeriodStart] = DS.[ActivityStart]
AND [UserID] = DS.[UserID]
)
AND [MissingPeriodStart] < [MissingPeriodEnd]
ORDER BY [UserId]
,[ActivityStart];
and the result is:
Of course, this is an idea. You may need to change it or tune it in order to be used with your real data. I hope it will give you a start.
Upvotes: 3