Reputation: 97
I got a search sessions log that looks like this:
+----------+-------------------------+----------+
| dt | search_time | searches |
+----------+-------------------------+----------+
| 20200601 | 2020-06-01 00:36:38.000 | 1 |
| 20200601 | 2020-06-01 00:37:38.000 | 1 |
| 20200601 | 2020-06-01 00:39:18.000 | 1 |
| 20200601 | 2020-06-01 01:16:18.000 | 1 |
| 20200601 | 2020-06-01 03:56:38.000 | 1 |
| 20200601 | 2020-06-01 05:36:38.000 | 1 |
| 20200601 | 2020-06-01 05:37:38.000 | 1 |
| 20200601 | 2020-06-01 05:39:38.000 | 1 |
| 20200601 | 2020-06-01 05:41:38.000 | 1 |
| 20200601 | 2020-06-01 07:26:38.000 | 1 |
+----------+-------------------------+----------+
My task is to partition each row into session groups. Session groups are up to five minutes.
For example:
Those TOP 3 sessions will form a group session 1 - if we accumulate the minutes between each row, we will get 3 minutes and the 4th would accumulate to more then 5 minutes so it will be a different session group.
+----------+-------------------------+----------+---------------+
| dt | search_time | searches | group_session |
+----------+-------------------------+----------+---------------+
| 20200601 | 2020-06-01 00:36:38.000 | 1 | 1 |
| 20200601 | 2020-06-01 00:37:38.000 | 1 | 1 |
| 20200601 | 2020-06-01 00:39:18.000 | 1 | 1 |
| 20200601 | 2020-06-01 01:16:18.000 | 1 | 2 |
+----------+-------------------------+----------+---------------+
I manipulated the table like this in order to get it ready for partitioning:
WITH [Sub Table] AS
(
SELECT [dt]
,[search_time]
,[pervious search time] = LAG(search_time) OVER (ORDER BY search_time)
,[min diff] = ISNULL(DATEDIFF(MINUTE,LAG(search_time) OVER (ORDER BY search_time),search_time),0)
,[searches]
FROM [search_session]
)
SELECT
[dt],
[search_time],
[pervious search time],
[min diff],
[searches]
FROM [Sub Table]
And got this:
+----------+-------------------------+-------------------------+----------+----------+
| dt | search_time | pervious search time | min diff | searches |
+----------+-------------------------+-------------------------+----------+----------+
| 20200601 | 2020-06-01 00:36:38.000 | NULL | 0 | 1 |
| 20200601 | 2020-06-01 00:37:38.000 | 2020-06-01 00:36:38.000 | 1 | 1 |
| 20200601 | 2020-06-01 00:39:18.000 | 2020-06-01 00:37:38.000 | 2 | 1 |
| 20200601 | 2020-06-01 01:16:18.000 | 2020-06-01 00:39:18.000 | 37 | 1 |
| 20200601 | 2020-06-01 03:56:38.000 | 2020-06-01 01:16:18.000 | 160 | 1 |
| 20200601 | 2020-06-01 05:36:38.000 | 2020-06-01 03:56:38.000 | 100 | 1 |
| 20200601 | 2020-06-01 05:37:38.000 | 2020-06-01 05:36:38.000 | 1 | 1 |
| 20200601 | 2020-06-01 05:39:38.000 | 2020-06-01 05:37:38.000 | 2 | 1 |
| 20200601 | 2020-06-01 05:41:38.000 | 2020-06-01 05:39:38.000 | 2 | 1 |
| 20200601 | 2020-06-01 07:26:38.000 | 2020-06-01 05:41:38.000 | 105 | 1 |
+----------+-------------------------+-------------------------+----------+----------+
I thought about two possibilities to continue:
Using a window function, like RANK(), I can partition the rows, but I can't figure out how to set the PARTITION BY caluse with a condition to do so.
To iterate the table with a WHILE loop - again finding hard time to form ths
Upvotes: 1
Views: 268
Reputation: 222582
This cannot be done with just window functions. You need some kind of iterative process, that keeps track of the first row of each group, and dynamically identififes the next one.
In SQL, you can phrase this with a recursive query:
with
data as (select t.*, row_number() over(order by search_time) rn from mytable t),
cte as (
select d.*, search_time as first_search_time
from data d
where rn = 1
union all
select d.*,
case when d.search_time > dateadd(minute, 5, c.first_search_time)
then d.search_time
else c.first_search_time
end
from cte c
inner join data d on d.rn = c.rn + 1
)
select c.*, dense_rank() over(order by first_search_time) grp
from cte c
For your sample data, this returns:
dt | search_time | searches | rn | first_search_time | grp :--------- | :---------------------- | -------: | -: | :---------------------- | --: 2020-06-01 | 2020-06-01 00:36:38.000 | 1 | 1 | 2020-06-01 00:36:38.000 | 1 2020-06-01 | 2020-06-01 00:37:38.000 | 1 | 2 | 2020-06-01 00:36:38.000 | 1 2020-06-01 | 2020-06-01 00:39:18.000 | 1 | 3 | 2020-06-01 00:36:38.000 | 1 2020-06-01 | 2020-06-01 01:16:18.000 | 1 | 4 | 2020-06-01 01:16:18.000 | 2 2020-06-01 | 2020-06-01 03:56:38.000 | 1 | 5 | 2020-06-01 03:56:38.000 | 3 2020-06-01 | 2020-06-01 05:36:38.000 | 1 | 6 | 2020-06-01 05:36:38.000 | 4 2020-06-01 | 2020-06-01 05:37:38.000 | 1 | 7 | 2020-06-01 05:36:38.000 | 4 2020-06-01 | 2020-06-01 05:39:38.000 | 1 | 8 | 2020-06-01 05:36:38.000 | 4 2020-06-01 | 2020-06-01 05:41:38.000 | 1 | 9 | 2020-06-01 05:36:38.000 | 4 2020-06-01 | 2020-06-01 07:26:38.000 | 1 | 10 | 2020-06-01 07:26:38.000 | 5
Upvotes: 1