user2845095
user2845095

Reputation: 515

Dealing with repetitive events in clickstream data in R

I have a dataset that looks like this:

Session UserID Duration(min) id Page#
123 345 40 1 Pg1
123 345 40 2 Pg2
123 345 40 3 Pg2
123 345 40 4 Pg3
123 345 40 5 Pg2
123 345 40 6 Pg3
124 346 100 1 Pg2
124 346 100 2 Pg2
124 346 100 3 Pg3
124 346 100 4 Pg4
125 347 59 1 Pg1
125 347 59 2 Pg5

I would like the dataset to look like this (below) because it would eventually create a more meaningful user journey.

Session UserID Duration(min) id Page#
123 345 40 1 Pg1
123 345 40 2 Pg2
123 345 40 3 Pg3
123 345 40 4 Pg2
123 345 40 5 Pg3
124 346 100 1 Pg2
124 346 100 2 Pg3
124 346 100 3 Pg4
125 347 59 1 Pg1
125 347 59 2 Pg5

Duration (min) = the number of minutes from the beginning of the path to the last page viewed. id = path number or path length number.

I am trying to take out subsequent repetitive events based on the session and user id, leaving the duration as is.

Upvotes: -1

Views: 56

Answers (1)

Caspar V.
Caspar V.

Reputation: 1812

library('dplyr')

df %>%
  
  # group rows by Session + UserID
  group_by(Session,UserID) %>%
  
  # filter rows where Page is different from the last Page (within group), or where last Page (within group) is NA
  filter(Page != lag(Page, 1) | is.na(lag(Page, 1)) ) %>%
  
  # create new ids
  mutate(id = row_number()) %>%
  
  # remove grouping, for future use
  ungroup()

Result:

# A tibble: 10 × 5
   Session UserID Duration.min.    id Page 
     <int>  <int>         <int> <int> <chr>
 1     123    345            40     1 Pg1  
 2     123    345            40     2 Pg2  
 3     123    345            40     3 Pg3  
 4     123    345            40     4 Pg2  
 5     123    345            40     5 Pg3  
 6     124    346           100     1 Pg2  
 7     124    346           100     2 Pg3  
 8     124    346           100     3 Pg4  
 9     125    347            59     1 Pg1  
10     125    347            59     2 Pg5  

Upvotes: 1

Related Questions