Grouping based on multiple variables, including time-series data

Question

I have a dataframe listing every "event" (term used for an action) carried out on an app. Each event has a user id, time stamp (in format: 'hms' num) and date stamp (in format: Date), among other variables but these are the only variables relevant to my problem. I am trying to group events that occur within 30 minutes of each other per user. For example, 8 rows of events for user 123 would be grouped into 2 rows, representing different sessions of activity on the app.

I have tried using the group_by function but couldn't figure out how to factor in a specific time period.

I don't currently have any code to show as an example of what I have tried.

The following is an example of what my dataframe looks like:

user ID         Event Name         Date         Time

23              Press              01/01/2019   10:20:52
23              Read               01/01/2019   10:21:43
23              Click              01/01/2019   10:27:21
23              Press              01/01/2019   10:28:05
87              Read               01/01/2019   11:42:51
87              Press              01/01/2019   12:16:02
87              Read               01/01/2019   12:17:49
23              Click              01/01/2019   15:42:51
23              Click              01/01/2019   15:43:45
23              Press              01/01/2019   15:45:12
64              Read               01/01/2019   18:01:33
64              Click              01/01/2019   18:02:26
64              Click              01/01/2019   18:02:58
64              Read               01/01/2019   18:04:19
64              Press              01/01/2019   18:10:47

In this example, I've only shown a few events on one day. however the dataframe has thousands of rows with all this information covering about 5 months.

Ideally I am trying to get the end result to look as follows:

user ID         Event Name         Date         Time

23              Session            01/01/2019   10:20:52
87              Session            01/01/2019   11:42:51
87              Session            01/01/2019   12:16:02
23              Session            01/01/2019   15:42:51
64              Session            01/01/2019   18:01:33

So essentially the multiple rows of events that occur within 30 minutes of each other, have been condensed into 1 row where the event name has been renamed as session. Any help would be massively appreciated, as I am still a bit of a newbie with R and this problem seems to be slightly more advanced than I am used to so far.

Benjamin · Accepted Answer

I'd take an iterative approach. The approach below would handle cases where a session (plain-English definition) somehow lasts more than 30 minutes. You'll want to make a cut point after the first 30, and then have the immediate next one start a new "session" (your definition). I couldn't think of how to do that without iterating in this way.

To start with your sample data:

library(tidyverse)
library(lubridate)

events <- tribble(
~`user ID`, ~`Event Name`,        ~Date,      ~Time,
        23,       "Press", "01/01/2019", "10:20:52",
        23,       "Read" , "01/01/2019", "10:21:43",
        23,       "Click", "01/01/2019", "10:27:21",
        23,       "Press", "01/01/2019", "10:28:05",
        87,       "Read" , "01/01/2019", "11:42:51",
        87,       "Press", "01/01/2019", "12:16:02",
        87,       "Read" , "01/01/2019", "12:17:49",
        23,       "Click", "01/01/2019", "15:42:51",
        23,       "Click", "01/01/2019", "15:43:45",
        23,       "Press", "01/01/2019", "15:45:12",
        64,       "Read" , "01/01/2019", "18:01:33",
        64,       "Click", "01/01/2019", "18:02:26",
        64,       "Click", "01/01/2019", "18:02:58",
        64,       "Read" , "01/01/2019", "18:04:19",
        64,       "Press", "01/01/2019", "18:10:47"
)

Then to add a reference row ID and a useful datetime field:

events <- events %>% 
  mutate(
    event_id = row_number(),
    date_time = mdy_hms(paste(Date, Time))
  )

Now we make our first table of sessions, just getting the first session for each user:

sessions <- events %>% 
  group_by(`user ID`) %>% 
  summarise(session_start = min(date_time)) %>% 
  mutate(session_end = session_start + minutes(30))

Then we iterate!

left_join() to find which events are already in a known session.
anti_join() tells us which records are not in that table of known matches.
If there aren't any such unknowns, you're done!
If there are, get those sessions, and add them to the sessions table.

while(TRUE) {
  in_a_known_session <- events %>% 
    left_join(sessions, by = "user ID") %>% 
    filter(date_time >= session_start & date_time < session_end)
  unassigned <- events %>% 
    anti_join(in_a_known_session, by = "event_id")
  if (nrow(unassigned) == 0) {
    break
  }
  sessions <- sessions %>% 
    bind_rows(
      unassigned %>% 
        group_by(`user ID`) %>% 
        summarise(session_start = min(date_time)) %>% 
        mutate(session_end = session_start + minutes(30))
    )
}

Finally, get it in exactly the form you're looking for in your example:

sessions <- sessions %>% 
  arrange(session_start) %>% 
  mutate(
    `Event Name` = "Session",
    Date = format(session_start, "%m/%d/%Y"),
    Time = format(session_start, "%H:%M:%S")
  ) %>% 
  select(-starts_with("session_"))

If this works for you, and you accept it, it'll be my first ever accepted StackOverflow answer! :D

Grouping based on multiple variables, including time-series data

Answers (1)

Related Questions