Reputation: 13
I have a dataframe listing every "event" (term used for an action) carried out on an app. Each event has a user id, time stamp (in format: 'hms' num) and date stamp (in format: Date), among other variables but these are the only variables relevant to my problem. I am trying to group events that occur within 30 minutes of each other per user. For example, 8 rows of events for user 123 would be grouped into 2 rows, representing different sessions of activity on the app.
I have tried using the group_by function but couldn't figure out how to factor in a specific time period.
I don't currently have any code to show as an example of what I have tried.
The following is an example of what my dataframe looks like:
user ID Event Name Date Time
23 Press 01/01/2019 10:20:52
23 Read 01/01/2019 10:21:43
23 Click 01/01/2019 10:27:21
23 Press 01/01/2019 10:28:05
87 Read 01/01/2019 11:42:51
87 Press 01/01/2019 12:16:02
87 Read 01/01/2019 12:17:49
23 Click 01/01/2019 15:42:51
23 Click 01/01/2019 15:43:45
23 Press 01/01/2019 15:45:12
64 Read 01/01/2019 18:01:33
64 Click 01/01/2019 18:02:26
64 Click 01/01/2019 18:02:58
64 Read 01/01/2019 18:04:19
64 Press 01/01/2019 18:10:47
In this example, I've only shown a few events on one day. however the dataframe has thousands of rows with all this information covering about 5 months.
Ideally I am trying to get the end result to look as follows:
user ID Event Name Date Time
23 Session 01/01/2019 10:20:52
87 Session 01/01/2019 11:42:51
87 Session 01/01/2019 12:16:02
23 Session 01/01/2019 15:42:51
64 Session 01/01/2019 18:01:33
So essentially the multiple rows of events that occur within 30 minutes of each other, have been condensed into 1 row where the event name has been renamed as session. Any help would be massively appreciated, as I am still a bit of a newbie with R and this problem seems to be slightly more advanced than I am used to so far.
Upvotes: 0
Views: 619
Reputation: 896
I'd take an iterative approach. The approach below would handle cases where a session (plain-English definition) somehow lasts more than 30 minutes. You'll want to make a cut point after the first 30, and then have the immediate next one start a new "session" (your definition). I couldn't think of how to do that without iterating in this way.
To start with your sample data:
library(tidyverse)
library(lubridate)
events <- tribble(
~`user ID`, ~`Event Name`, ~Date, ~Time,
23, "Press", "01/01/2019", "10:20:52",
23, "Read" , "01/01/2019", "10:21:43",
23, "Click", "01/01/2019", "10:27:21",
23, "Press", "01/01/2019", "10:28:05",
87, "Read" , "01/01/2019", "11:42:51",
87, "Press", "01/01/2019", "12:16:02",
87, "Read" , "01/01/2019", "12:17:49",
23, "Click", "01/01/2019", "15:42:51",
23, "Click", "01/01/2019", "15:43:45",
23, "Press", "01/01/2019", "15:45:12",
64, "Read" , "01/01/2019", "18:01:33",
64, "Click", "01/01/2019", "18:02:26",
64, "Click", "01/01/2019", "18:02:58",
64, "Read" , "01/01/2019", "18:04:19",
64, "Press", "01/01/2019", "18:10:47"
)
Then to add a reference row ID and a useful datetime field:
events <- events %>%
mutate(
event_id = row_number(),
date_time = mdy_hms(paste(Date, Time))
)
Now we make our first table of sessions, just getting the first session for each user:
sessions <- events %>%
group_by(`user ID`) %>%
summarise(session_start = min(date_time)) %>%
mutate(session_end = session_start + minutes(30))
Then we iterate!
left_join()
to find which events are already in a known session.anti_join()
tells us which records are not in that table of known matches.sessions
table.while(TRUE) {
in_a_known_session <- events %>%
left_join(sessions, by = "user ID") %>%
filter(date_time >= session_start & date_time < session_end)
unassigned <- events %>%
anti_join(in_a_known_session, by = "event_id")
if (nrow(unassigned) == 0) {
break
}
sessions <- sessions %>%
bind_rows(
unassigned %>%
group_by(`user ID`) %>%
summarise(session_start = min(date_time)) %>%
mutate(session_end = session_start + minutes(30))
)
}
Finally, get it in exactly the form you're looking for in your example:
sessions <- sessions %>%
arrange(session_start) %>%
mutate(
`Event Name` = "Session",
Date = format(session_start, "%m/%d/%Y"),
Time = format(session_start, "%H:%M:%S")
) %>%
select(-starts_with("session_"))
If this works for you, and you accept it, it'll be my first ever accepted StackOverflow answer! :D
Upvotes: 3