Merik
Merik

Reputation: 2827

User sessions based on user activity

Preface: Sorry for the title, it doesn't really describe the question. Please feel free to edit.

Data that shows activities of a user:

actionid userid subjectid dtm
1        1      1         2016-01-04 08:00:00.000
2        1      1         2016-01-04 08:10:00.000
3        1      1         2016-01-04 08:20:00.000
4        1      1         2016-01-04 10:10:00.000
5        1      1         2016-01-04 10:25:00.000
11       1      2         2016-01-05 08:00:00.000
12       1      2         2016-01-25 08:10:00.000
13       1      2         2016-01-05 08:20:00.000
14       1      2         2016-01-05 10:20:00.000
15       1      2         2016-01-05 10:25:00.000
16       2      1         2016-01-05 12:00:00.000
17       2      1         2016-01-05 12:00:00.000
18       2      3         2016-01-05 12:04:00.000
19       2      3         2016-01-05 12:05:00.000
20       2      1         2016-01-05 12:07:30.000

A "session" is defined as a collection of user activities that are done in the same sitting. A session starts with a user working on a particular subject, and ends when either (a) the user switches to a different subject, or (b) 60 minutes pass.

The objective is to identify the sessions and assign them session IDs, all in SQL. Standard SQL is preferred, but since this data is on a Microsoft SQL Server, T-SQL features (like CROSS APPLY, LEAD, LAG, etc) are also acceptable.

The desired outcome would be something like:

actionid userid subjectid dtm                     sessionid
1        1      1         2016-01-04 08:00:00.000 1
2        1      1         2016-01-04 08:10:00.000 1
3        1      1         2016-01-04 08:20:00.000 1
4        1      1         2016-01-04 10:10:00.000 2
5        1      1         2016-01-04 10:25:00.000 2
11       1      2         2016-01-05 08:00:00.000 3
12       1      2         2016-01-25 08:10:00.000 3
13       1      2         2016-01-05 08:20:00.000 3
14       1      2         2016-01-05 10:20:00.000 4
15       1      2         2016-01-05 10:25:00.000 4
16       2      1         2016-01-05 12:00:00.000 5
17       2      1         2016-01-05 12:00:00.000 5
18       2      3         2016-01-05 12:04:00.000 6
19       2      3         2016-01-05 12:05:00.000 6
20       2      1         2016-01-05 12:07:30.000 7

I will explain the logic again for clarity, using this example:

Action #4 is not going to belong in Session #1, because it is more than 1 hour since the first action in that session. Session #2 ends when the user switches to subject 2. Action #16 and #17 have the same time, and that is fine. Action #20 is going to be a new session, even though it is within an hour of action #17, because the user switched subjects in between #18 and #20.

Upvotes: 0

Views: 195

Answers (1)

Ross Bush
Ross Bush

Reputation: 15185

If using SQL Server 2012 + then you definitely would want to make use of LAG(). Here is a comparison of doing it the old way versus using the LAG() windowing function.

Old Way

SELECT
    this.actionid,
    this.userid,
    this.subjectid,
    this.dtm,
    UserSubjectBoundryMarker=CASE WHEN prev.subjectid<>this.subjectid OR prev.subjectID IS NULL THEN 1 ELSE 0 END   
FROM
(
    SELECT
        *,
        ID=ROW_NUMBER() OVER(PARTITION BY 1 order by userid,actionid)
    FROM
        mytable
    )AS this
    LEFT OUTER JOIN 
    (
        SELECT
            *,
            ID=ROW_NUMBER() OVER(PARTITION BY 1 order by userid,actionid)
        FROM
            mytable 
    )
    Prev ON (Prev.ID=this.ID-1) OR(Prev.ID IS NULL)
ORDER BY
    userid,dtm

Using LAG()

SELECT
    *,
    UserSubjectBoundryMarker=CASE WHEN LAG(subjectid, 1,0) OVER (ORDER BY userid,actionid) <> this.subjectID THEN 1 ELSE 0 END
FROM
    mytable this

Solution

1. Mark dynamic subject group boundaries with a marker.
2. Apply a dynamic group id to changes using max(change=1) + max(actionid)
3. Mark dynamic hour group boundaries to the result of 2 with a marker.
4. Apply a dynamic group id to the changes using max(change=1) + max(result of #2).
5. Apply a dense rank to your outer group from #4 in a sequential order.

SELECT
    e.actionid,
    e.userid,
    e.subjectid,
    e.dtm,
    sessionid=DENSE_RANK() OVER(PARTITION BY  1 ORDER BY userid,subjectid,SubjectChangeGroupID,SubjectHourChangeBoundry,SubjectHourChangeGroupID  )
FROM
(
    SELECT
        *,
        SubjectHourChangeGroupID=
            MAX(SubjectHourChangeBoundry) OVER (PARTITION BY userid,subjectid,SubjectChangeGroupID,SubjectHourChangeBoundry) 
            +
            MAX(SubjectChangeGroupID) OVER (PARTITION BY userid,subjectid,SubjectChangeGroupID,SubjectHourChangeBoundry)    
    FROM
    (

        SELECT
            *,
            SubjectHourChangeBoundry=CASE WHEN DATEDIFF(HOUR,MIN(C.dtm) OVER (PARTITION BY SubjectChangeGroupID),C.dtm) > 1 THEN 1 ELSE 0 END
        FROM
        (
            SELECT 
                *,
                SubjectChangeGroupID=
                    MAX(UserSubjectBoundryMarker) OVER (PARTITION BY userid,subjectid) 
                    +
                    MAX(actionid) OVER (PARTITION BY userid,subjectid)
            FROM
            (
                SELECT *,UserSubjectBoundryMarker=CASE WHEN LAG(a.subjectid, 1,0) OVER (ORDER BY a.userid,a.actionid) <> a.subjectID THEN 1 ELSE 0 END
                FROM
                    mytable A
            )AS B
        )AS C
    )AS D   
)AS E

Upvotes: 1

Related Questions