Reputation: 2827
Preface: Sorry for the title, it doesn't really describe the question. Please feel free to edit.
Data that shows activities of a user:
actionid userid subjectid dtm
1 1 1 2016-01-04 08:00:00.000
2 1 1 2016-01-04 08:10:00.000
3 1 1 2016-01-04 08:20:00.000
4 1 1 2016-01-04 10:10:00.000
5 1 1 2016-01-04 10:25:00.000
11 1 2 2016-01-05 08:00:00.000
12 1 2 2016-01-25 08:10:00.000
13 1 2 2016-01-05 08:20:00.000
14 1 2 2016-01-05 10:20:00.000
15 1 2 2016-01-05 10:25:00.000
16 2 1 2016-01-05 12:00:00.000
17 2 1 2016-01-05 12:00:00.000
18 2 3 2016-01-05 12:04:00.000
19 2 3 2016-01-05 12:05:00.000
20 2 1 2016-01-05 12:07:30.000
A "session" is defined as a collection of user activities that are done in the same sitting. A session starts with a user working on a particular subject, and ends when either (a) the user switches to a different subject, or (b) 60 minutes pass.
The objective is to identify the sessions and assign them session IDs, all in SQL. Standard SQL is preferred, but since this data is on a Microsoft SQL Server, T-SQL features (like CROSS APPLY, LEAD, LAG, etc) are also acceptable.
The desired outcome would be something like:
actionid userid subjectid dtm sessionid
1 1 1 2016-01-04 08:00:00.000 1
2 1 1 2016-01-04 08:10:00.000 1
3 1 1 2016-01-04 08:20:00.000 1
4 1 1 2016-01-04 10:10:00.000 2
5 1 1 2016-01-04 10:25:00.000 2
11 1 2 2016-01-05 08:00:00.000 3
12 1 2 2016-01-25 08:10:00.000 3
13 1 2 2016-01-05 08:20:00.000 3
14 1 2 2016-01-05 10:20:00.000 4
15 1 2 2016-01-05 10:25:00.000 4
16 2 1 2016-01-05 12:00:00.000 5
17 2 1 2016-01-05 12:00:00.000 5
18 2 3 2016-01-05 12:04:00.000 6
19 2 3 2016-01-05 12:05:00.000 6
20 2 1 2016-01-05 12:07:30.000 7
I will explain the logic again for clarity, using this example:
Action #4 is not going to belong in Session #1, because it is more than 1 hour since the first action in that session. Session #2 ends when the user switches to subject 2. Action #16 and #17 have the same time, and that is fine. Action #20 is going to be a new session, even though it is within an hour of action #17, because the user switched subjects in between #18 and #20.
Upvotes: 0
Views: 195
Reputation: 15185
If using SQL Server 2012 + then you definitely would want to make use of LAG()
. Here is a comparison of doing it the old way versus using the LAG()
windowing function.
Old Way
SELECT
this.actionid,
this.userid,
this.subjectid,
this.dtm,
UserSubjectBoundryMarker=CASE WHEN prev.subjectid<>this.subjectid OR prev.subjectID IS NULL THEN 1 ELSE 0 END
FROM
(
SELECT
*,
ID=ROW_NUMBER() OVER(PARTITION BY 1 order by userid,actionid)
FROM
mytable
)AS this
LEFT OUTER JOIN
(
SELECT
*,
ID=ROW_NUMBER() OVER(PARTITION BY 1 order by userid,actionid)
FROM
mytable
)
Prev ON (Prev.ID=this.ID-1) OR(Prev.ID IS NULL)
ORDER BY
userid,dtm
Using LAG()
SELECT
*,
UserSubjectBoundryMarker=CASE WHEN LAG(subjectid, 1,0) OVER (ORDER BY userid,actionid) <> this.subjectID THEN 1 ELSE 0 END
FROM
mytable this
Solution
1. Mark dynamic subject group boundaries with a marker.
2. Apply a dynamic group id to changes using max(change=1) + max(actionid)
3. Mark dynamic hour group boundaries to the result of 2 with a marker.
4. Apply a dynamic group id to the changes using max(change=1) + max(result of #2).
5. Apply a dense rank to your outer group from #4 in a sequential order.
SELECT
e.actionid,
e.userid,
e.subjectid,
e.dtm,
sessionid=DENSE_RANK() OVER(PARTITION BY 1 ORDER BY userid,subjectid,SubjectChangeGroupID,SubjectHourChangeBoundry,SubjectHourChangeGroupID )
FROM
(
SELECT
*,
SubjectHourChangeGroupID=
MAX(SubjectHourChangeBoundry) OVER (PARTITION BY userid,subjectid,SubjectChangeGroupID,SubjectHourChangeBoundry)
+
MAX(SubjectChangeGroupID) OVER (PARTITION BY userid,subjectid,SubjectChangeGroupID,SubjectHourChangeBoundry)
FROM
(
SELECT
*,
SubjectHourChangeBoundry=CASE WHEN DATEDIFF(HOUR,MIN(C.dtm) OVER (PARTITION BY SubjectChangeGroupID),C.dtm) > 1 THEN 1 ELSE 0 END
FROM
(
SELECT
*,
SubjectChangeGroupID=
MAX(UserSubjectBoundryMarker) OVER (PARTITION BY userid,subjectid)
+
MAX(actionid) OVER (PARTITION BY userid,subjectid)
FROM
(
SELECT *,UserSubjectBoundryMarker=CASE WHEN LAG(a.subjectid, 1,0) OVER (ORDER BY a.userid,a.actionid) <> a.subjectID THEN 1 ELSE 0 END
FROM
mytable A
)AS B
)AS C
)AS D
)AS E
Upvotes: 1