aseipel
aseipel

Reputation: 738

Filter groups using a condition for each group in R

I have the following two data frames of user events:

data.favorite (user favorited item at time)

           user   item   time   event
1             1      A      2     fav
2             1      B      6     fav
3             2      D      9     fav
4             3      A      5     fav

data.view (user viewed item at time)

           user   item   time   event
1             1      A      1    view
2             1      A      3    view
3             1      B      4    view
4             1      B      5    view
5             1      B      7    view
6             1      C      8    view
7             3      A      2    view
8             3      A      9    view

I now only want to keep those events of data.view that occured after that user favorited that item. E.g. row 1 of data.view would be removed, as user 1 favorited item A at 2. The view event at time 3 however would remain, as the user had already favorited the item at that point. So, the result for this example should look like this:

           user   item   time   event
1             1      A      3    view
2             1      B      7    view
3             3      A      9    view

My current approach is way too slow. I apply a custom function to data.view:

wasFav = function(u, i, t) {
  favs = data.favorite %>% filter(user == u, item == i, time < t)
  return(nrow(favs) > 0)
}

Any ideas for a faster approach?

Upvotes: 1

Views: 3865

Answers (3)

Janna Maas
Janna Maas

Reputation: 1134

i would join by user and item, assuming that every user-item pair occurs only once in data.favorite. you can then directly compare viewtime with the time an item was favourited and discard all instances where time_viewed < time_favorited:

data.view %>%
left_join(data.favorite, by=c("user", "item"), suffix=c("_view","_fav")) %>%
filter(time_view > time_fav)

ETA: that was before i learned about the 'non-equi joins' @Henrik mentions in the comments above. Those sound cool.

Upvotes: 1

eipi10
eipi10

Reputation: 93871

We can combine the two data frames, group by user and item and then keep only event rows in data.view that occur after a fav. We use cumsum to count up instances of fav and select all rows from the first instance of fav onward.

The first set of code is for illustration, so you can see what the method is doing. The second set of code does the filtering directly.

library(tidyverse)

data.favorite %>% bind_rows(data.view) %>%
  arrange(user, item, time) %>%
  group_by(user, item) %>%
  mutate(sequence = cumsum(event=="fav")) 
    user  item  time event sequence
1      1     A     1  view        0
2      1     A     2   fav        1
3      1     A     3  view        1
4      1     B     4  view        0
5      1     B     5  view        0
6      1     B     6   fav        1
7      1     B     7  view        1
8      1     C     8  view        0
9      2     D     9   fav        1
10     3     A     2  view        0
11     3     A     5   fav        1
12     3     A     9  view        1
data.favorite %>% bind_rows(data.view) %>%
  arrange(user, item, time) %>%
  group_by(user, item) %>%
  filter(cumsum(event=="fav") >= 1, event=="view")
   user  item  time event
1     1     A     3  view
2     1     B     7  view
3     3     A     9  view

Upvotes: 1

user3640617
user3640617

Reputation: 1576

Using match with data.frames called data.view and data.fav:

#Find indices of matching users&items
Indices <- match(paste(data.view$user, data.view$item), paste(data.fav$user, data.fav$item))

#add corresponding fav time to data.view:    
data.view$favtime <- data.fav$time[Indices] 

#only keep rows in which time is greater than fav.time:
data.view <- data.view[data.view$time>data.view$favtime & !is.na(data.view$favtime),] 

Upvotes: 1

Related Questions