Reputation: 1
My task was to open trial cluster Redshift and to upload there data from csv file.
In csv file was given information about id of a user, what page he has visited and time of visit up to milliseconds.
The task is to find all "sessions" where user have done the next actions in the given order: rooms.homework-showcase, rooms.view.step.content,rooms.lesson.rev.step.content. (He can did it not successively).
By "session" it is meant that time passed between actions no more than one hour.
The output should consists of users id, time of the start of session and time of the end of the session.
Sample of what is given in csv:
57529,rooms.homework-showcase,2017-03-01T00:00:07.710000
57529,rooms.view.step.content,2017-03-01T00:00:10.275000
57529,rooms.view.step.content,2017-03-01T00:00:10.436000
168671,rooms.view.step.content,2017-03-01T00:00:12.035000
168671,rooms.view.step.content,2017-03-01T00:00:50.632000
64788,rooms.view.step.content,2017-03-01T00:01:21.460000
93698,rooms.view.step.content,2017-03-01T00:02:41.963000
205265,rooms.homework-showcase,2017-03-01T00:02:45.241000
205265,rooms.test-showcase,2017-03-01T00:02:57.854000
205265,notes,2017-03-01T00:03:01.016000
What I have written so far:
select *,
case
when timepass IS NULL then ncount
when (timepass>=3600) and lag(timepass<3600) over(order by user_id, datet)
then ncount
else Null
end as startt,
case
when (timepass<3600) and lead(timepass IS NULL) over(order by user_id,
datet) then ncount
when (timepass<3600) and lead(timepass>=3600) over(order by user_id, datet)
then ncount
else Null
end as endt
from
(
select *, row_number() OVER(ORDER BY user_id, datet) as ncount,
case
when page_name = 'rooms.homework-showcase' then 0
when page_name = 'rooms.view.step.content' then 1
when page_name = 'rooms.lesson.rev.step.content' then 2
else 3
end as page_order
from
(
select *, EXTRACT(EPOCH from datet) - lag(EXTRACT(EPOCH from datet))
over(partition by user_id order by user_id, datet) as timepass from
(
SELECT *, to_timestamp(date_time, 'YYYY-MM-DD HH:MI:SS:US') as datet from
testtbl
order by date_time asc
) as t1
group by user_id, page_name, date_time, datet
order by user_id, datet asc
) as t3
) as t4
;
I can figure out the beginnig of the session and the end (except I don't know how to include the last row as the end of session). One user can have several sessions.
How can I query the right sessions and take from them user_id, start and end of session? Could you give me some ideas? I don't have enough knowledge in SQL to see the way of writing the right query.
P.S. I've attached how output should look like. This shows particular user and his one session which falls under given criteria. And expected output is a last picture. And there should be listed all of "good" sessions.
The csv file is under this link https://drive.google.com/open?id=0B288Ep6mrn6fWlQ0a3doV0hjQjA
DB Fiddle https://www.db-fiddle.com/f/bz92LHiTW5RrFBUkaH2KtR/0#&togetherjs=m6SnA9xLSS
Upvotes: 0
Views: 174
Reputation: 145
I have run for the sample data you provided using below code through self join. Below is the code:
select user_id,session_begin,max(session_end) as session_end from
(select user_id,session_begin,case when session_end is null then session_begin else session_end end as session_end from
(select a.user_id as user_id,a.datevisited,a.date_time as session_begin,a.session_hour,case when (b.date_time>a.date_time and a.session_hour>=b.date_time)then b.date_time else null end as session_end from
(select user_id,trunc(date_time) as datevisited,date_time,dateadd(hour,1,date_time)session_hour from pages ) a
left join (select user_id,trunc(date_time) as datevisited,date_time,dateadd(hour,1,date_time)session_hour from pages ) b
on a.user_id=b.user_id and a.datevisited=b.datevisited
group by 1,2,3,4,5)
group by 1,2,3)final
group by 1,2;
Test it at your end. For the various functions used in the code you can refer Redshift documentions: https://docs.aws.amazon.com/redshift/latest/dg/r_DATEADD_function.html https://docs.aws.amazon.com/redshift/latest/dg/r_DATEDIFF_function.html
Upvotes: 0