Reputation: 23
Suppose I have this table:
CREATE TABLE #t1
(
PersonID int ,
ExamDates date,
Score varchar(50) SPARSE NULL,
);
SET dateformat mdy;
INSERT INTO #t1 (PersonID, ExamDates, Score)
VALUES (1, '1.1.2018',70),
(1, '1.13.2018', 100),
(1, '1.18.2018', 85),
(2, '1.1.2018', 90),
(2, '2.1.2018', 95),
(2, '3.15.2018', 95),
(2, '7.30.2018', 100),
(3, '1.1.2018', 80),
(3, '1.2.2018', 80),
(3, '5.3.2018', 50),
(4, '2.1.2018', 90),
(4, '2.20.2018', 100);
I would like to find observations that occurs at least 3 times spanning at least 15 days but no more than 90 days for each unique ID. My final table should look like this:
PersonID | ExamDates | Score |
---|---|---|
1 | 1/1/2018 | 70 |
1 | 1/13/2018 | 100 |
1 | 1/18/2018 | 85 |
2 | 1/1/2018 | 90 |
2 | 2/1/2018 | 95 |
2 | 3/15/2018 | 95 |
We have code working for this using R, but would like to avoid pulling large datasets into R just to run this code. We are doing this in a very large dataset and concerned about efficiency of the query.
Thanks! -Peter
Upvotes: 2
Views: 156
Reputation: 2615
Here's Eli's idea done a bit more simply, and moving all of the heavy computation to the cte, where it may possibly be more efficient:
With cte As (
Select PersonID, ExamDates
,Case When Datediff(DAY,ExamDates, Lead(ExamDates,2,Null) Over (Partition by PersonID Order by ExamDates)) Between 15 and 90
Then Lead(ExamDates,2,Null) Over (Partition by PersonID Order by ExamDates)
Else NULL End as EndDateRange
From #t1
)
Select Distinct B.*
From cte Inner Join #t1 B On B.PersonID=cte.PersonID
And B.ExamDates Between cte.ExamDates and cte.EndDateRange
The Case statement in the CTE only returns a valid date if the entry two items later satisfies the overall condition; that date is used to form a range with the current record's ExamDate. By returning NULL on non-qualified ranges we ensure the join in the outer part of the SQL is not satisfied. The Distinct clause is needed to collapse duplicates when there are are 4+ consecutive observations within the 15-90 day range.
Upvotes: 0
Reputation: 416111
To start with, the common name for this situation is Gaps and Islands. That will help you as you search for answers or come up with similar problems in the future.
That out of the way, here is my solution. Start with this:
WITH Leads As (
SELECT t1.*
, datediff(day, ExamDates, lead(ExamDates, 2, NULL) over (partition by PersonID ORDER BY ExamDates)) As Diff
FROM t1
)
SELECT *
FROM Leads
WHERE Diff BETWEEN 15 AND 90
I have to use the CTE, because you can't put a windowing function in a WHERE clause. It produces this result, which is only part of what you want:
PersonID | ExamDates | Score | Diff |
---|---|---|---|
1 | 2018-01-01 | 70 | 17 |
2 | 2018-01-01 | 90 | 73 |
This shows the first record in each group. We can use it to join back to the original table and find all the records that meet the requirements.
But first, we have a problem. The sample data only has groups with exactly three records. However, the real data might end up with groups with more than three items. In that case this would find multiple first records from the same group.
You can see it in this updated SQL Fiddle, which adds an additional record for PersonID #1 that is still inside the date range.
PersonID | ExamDates | Score | Diff |
---|---|---|---|
1 | 2018-01-01 | 70 | 17 |
1 | 2018-01-13 | 100 | 29 |
2 | 2018-01-01 | 90 | 73 |
I'll be using this additional record in every step from now on.
To account for this, we also need to check to see each record is not in the middle or end of a valid group. That is, also look a couple records both ahead and behind.
WITH Diffs As (
SELECT #t1.*
, datediff(day, ExamDates, lead(ExamDates, 2, NULL) over (partition by PersonID ORDER BY ExamDates)) As LeadDiff2
, datediff(day, ExamDates, lead(ExamDates, 2, NULL) over (partition by PersonID ORDER BY ExamDates)) As LeadDiff1
, datediff(day, lag(ExamDates, 1, NULL) over (partition by PersonID ORDER BY ExamDates), ExamDates) as LagDiff1
, datediff(day, lag(ExamDates, 2, NULL) over (partition by PersonID ORDER BY ExamDates), ExamDates) as LagDiff2
FROM #t1
)
SELECT *
FROM Diffs
WHERE LeadDiff2 BETWEEN 15 AND 90
AND coalesce(LeadDiff1 + LagDiff1,100) > 90 /* Not in the middle of a valid group */
AND coalesce(Lagdiff2, 100) > 90 /* Not at the end of a valid group */
This code gets us back to the original results, even with the additional record. Here's the updated fiddle:
Now we can join back to the original table and find all records in each group:
WITH Diffs As (
SELECT 3t1.*
, datediff(day, ExamDates, lead(ExamDates, 2, NULL) over (partition by PersonID ORDER BY ExamDates)) As LeadDiff2
, datediff(day, ExamDates, lead(ExamDates, 2, NULL) over (partition by PersonID ORDER BY ExamDates)) As LeadDiff1
, datediff(day, lag(ExamDates, 1, NULL) over (partition by PersonID ORDER BY ExamDates), ExamDates) as LagDiff1
, datediff(day, lag(ExamDates, 2, NULL) over (partition by PersonID ORDER BY ExamDates), ExamDates) as LagDiff2
FROM #t1
), FirstRecords AS (
SELECT PersonID, ExamDates, DATEADD(day, 90, ExamDates) AS FinalDate
FROM Diffs
WHERE LeadDiff2 BETWEEN 15 AND 90
AND coalesce(LeadDiff1 + LagDiff1,100) > 90 /* Not in the middle of a valid group */
AND coalesce(lagdiff2, 100) > 90 /* Not at the end of a valid group */
)
SELECT t.*
FROM FirstRecords f
INNER JOIN #t1 t ON t.PersonID = f.PersonID
AND t.ExamDates >= f.ExamDates
AND t.ExamDates <= f.FinalDate
ORDER BY t.PersonID, t.ExamDates
That gives me this, which matches your desired output and my extra record:
PersonID | ExamDates | Score |
---|---|---|
1 | 2018-01-01 | 70 |
1 | 2018-01-13 | 100 |
1 | 2018-01-18 | 85 |
1 | 2018-02-11 | 89 |
2 | 2018-01-01 | 90 |
2 | 2018-02-01 | 95 |
2 | 2018-03-15 | 95 |
See it work here:
Upvotes: 2
Reputation: 2608
You'll need a CTE to identify the base for the conditions which you described.
This code works with your sample set, and should work even when you have a larger set - though may require a distinct if you have overlapping results, i.e. 5 exam dates in the 15-90 range.
WITH cte AS(
SELECT
PERSONID
,EXAMDATES
,Score
,COUNT(*) OVER (PARTITION BY PERSONID ORDER BY ExamDates ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW )AS COUNTS
,LAG(ExamDates,2,NULL) OVER (PARTITION BY PERSONID ORDER BY ExamDates) DIFFS
FROM #t1
)
SELECT B.*
FROM CTE
INNER JOIN #T1 B ON CTE.PERSONID = B.PERSONID
WHERE CTE.COUNTS >=3
AND DATEDIFF(DAY,CTE.DIFFS,CTE.EXAMDATES) BETWEEN 15 AND 90
AND B.EXAMDATES BETWEEN CTE.DIFFS AND CTE.EXAMDATES
Upvotes: 0