Reputation: 33
sorry if my title doesn't properly describe what task I'm trying to perform.
For a university project, I have received the access logs of a website, I have discarded the unneeded columns and condensed it down to this:
╔══════════╦══════════════════════╦═════════════════╦═════════════╦════════════════╗
║ accessid ║ date_time_in_seconds ║ yg_requester_id ║ referent_id ║ referent_docid ║
╠══════════╬══════════════════════╬═════════════════╬═════════════╬════════════════╣
║ 2449 ║ 2009011621830 ║ 32276 ║ 12648 ║ 1 ║
║ 2776 ║ 2009011622726 ║ 76360 ║ 11070 ║ 1 ║
║ 2804 ║ 2009011622783 ║ 32276 ║ 13845 ║ 1 ║
║ 2894 ║ 2009011623025 ║ 32276 ║ 7222 ║ 1 ║
║ 2895 ║ 2009011623037 ║ 32276 ║ 1530 ║ 1 ║
║ 3000 ║ 2009011623406 ║ 32276 ║ 3728 ║ 1 ║
║ 3019 ║ 2009011623497 ║ 520060 ║ 10356 ║ 1 ║
║ 3245 ║ 2009011625780 ║ 300841 ║ 4607 ║ 1 ║
║ 3274 ║ 2009011628309 ║ 532664 ║ 14377 ║ 1 ║
║ 3275 ║ 2009011628420 ║ 532664 ║ 9097 ║ 1 ║
╚══════════╩══════════════════════╩═════════════════╩═════════════╩════════════════╝
Originally the time and datestamps were in seperate columns per unit of measurement (year, month, day, hour, minute, second) and for the purposes of easier calculation, I have consolidated them into date_time_in_seconds which has the format
[0000][00][00][00000]
[YEAR][MONTH][DAY][Number of Seconds since 00:00]
accessid is the table entry ID, yg_requester_id is the unique id of the website visitor, referent_id is the ID of the web site article they read, referent_docid denotes the type of article, however is not needed in this task.
Basically, I would like to be able to find the time difference since the last different referent_id was accessed by the same yg_requester_id. So for instance, looking at this section of rows from the above table:
╔══════════╦══════════════════════╦═════════════════╦═════════════╦════════════════╗
║ accessid ║ date_time_in_seconds ║ yg_requester_id ║ referent_id ║ referent_docid ║
╠══════════╬══════════════════════╬═════════════════╬═════════════╬════════════════╣
║ 2449 ║ 2009011621830 ║ 32276 ║ 12648 ║ 1 ║
║ 2776 ║ 2009011622726 ║ 76360 ║ 11070 ║ 1 ║
║ 2804 ║ 2009011622783 ║ 32276 ║ 13845 ║ 1 ║
╚══════════╩══════════════════════╩═════════════════╩═════════════╩════════════════╝
yg_requester_id 32276 accessed the article with id 12648 at 06:03:50 (21830 seconds after midnight) on the 16th of January 2009. They then accessed the article with id 13845 at 06:19:43 (22783 seconds after midnight) on the 16th January 2009. So it is safe to assume that the user read the first article (id 12648) for about 15 minutes and 50 seconds
What I would like to find is that time difference between the articles accessed by the same user. Consecutive articles read by a user may not have consecutive accessid's (although it will always increment). I would also like to limit the time read to about an hour as the task is to filter out records where the time read is under a variable number of minutes (15 for instance).
Thanks in advance, let me know if any more information is needed
Upvotes: 3
Views: 1001
Reputation: 1059
This query should retrieve the requestor, referent and the time difference in seconds that is taken by the requestor on the referent:
select abc.A_requestor as requestor_id,abc.B_refer as referent_id,abc.A_datetime-abc.B_datetime as time_difference from
(select a.accessid as A_accessid ,b.accessid as B_accessid,
a.yg_requestor_id as A_requestor,a.date_time_in_seconds as A_datetime,a.referent_id as A_refer,
b.yg_requestor_id as B_requestor,b.date_time_in_seconds as B_datetime,b.referent_id as B_refer
from weblog a
inner join weblog b
on a.yg_requestor_id = b.yg_requestor_id
and a.date_time_in_seconds > b.date_time_in_seconds
and a.referent_id != b.referent_id) abc
inner join
(select cte.B_accessid,min(cte.A_accessid) as C_accessid from
(select a.accessid as A_accessid ,b.accessid as B_accessid,
a.yg_requestor_id as A_requestor,a.date_time_in_seconds as A_datetime,a.referent_id as A_refer,
b.yg_requestor_id as B_requestor,b.date_time_in_seconds as B_datetime,b.referent_id as B_refer
from weblog a
inner join weblog b
on a.yg_requestor_id = b.yg_requestor_id
and a.date_time_in_seconds > b.date_time_in_seconds
and a.referent_id != b.referent_id) cte
group by cte.B_accessid ) xyz
on xyz.B_accessid = abc.B_accessid and xyz.C_accessid = abc.A_accessid
Upvotes: 0
Reputation: 1120
I would use ROW_NUMBER partitioning the resultset by yg_requester_id and ordering it by either accessid or datetime (supposing you are going to change your date_time_in_seconds column into a regular datetime column, as suggested in the comments. Then I would join the resultset with itself by requester and to the previous record, and get the difference.
Let me try to write the query without having the proper data:
SELECT X1.yg_requester_id, DATEDIFF(SECOND, X1.NewDateTimeField, X2.NewDateTimeField) AS TimeDifferenceInSeconds, X1.referent_id AS NewArticle, X2.referent_id AS FormerArticle
FROM
(
SELECT ROW_NUMBER() OVER(PARTITION BY yg_requester_id ORDER BY NewDateTimeField DESC) AS Position, NewDateTimeField, yg_requester_id, referent_id
FROM YourTable
) X1
INNER JOIN
(
SELECT ROW_NUMBER() OVER(PARTITION BY yg_requester_id ORDER BY NewDateTimeField DESC) AS Position, NewDateTimeField, yg_requester_id, referent_id
FROM YourTable
) X2 ON X2.yg_requester_id = X1.yg_requester_id AND X2.Position = X1.Position - 1
Upvotes: 2