Reputation: 738
I have the following two data frames of user events:
data.favorite (user favorited item at time)
user item time event
1 1 A 2 fav
2 1 B 6 fav
3 2 D 9 fav
4 3 A 5 fav
data.view (user viewed item at time)
user item time event
1 1 A 1 view
2 1 A 3 view
3 1 B 4 view
4 1 B 5 view
5 1 B 7 view
6 1 C 8 view
7 3 A 2 view
8 3 A 9 view
I now only want to keep those events of data.view that occured after that user favorited that item. E.g. row 1 of data.view would be removed, as user 1 favorited item A at 2. The view event at time 3 however would remain, as the user had already favorited the item at that point. So, the result for this example should look like this:
user item time event
1 1 A 3 view
2 1 B 7 view
3 3 A 9 view
My current approach is way too slow. I apply a custom function to data.view:
wasFav = function(u, i, t) {
favs = data.favorite %>% filter(user == u, item == i, time < t)
return(nrow(favs) > 0)
}
Any ideas for a faster approach?
Upvotes: 1
Views: 3865
Reputation: 1134
i would join by user
and item
, assuming that every user-item pair occurs only once in data.favorite. you can then directly compare viewtime with the time an item was favourited and discard all instances where time_viewed < time_favorited:
data.view %>%
left_join(data.favorite, by=c("user", "item"), suffix=c("_view","_fav")) %>%
filter(time_view > time_fav)
ETA: that was before i learned about the 'non-equi joins' @Henrik mentions in the comments above. Those sound cool.
Upvotes: 1
Reputation: 93871
We can combine the two data frames, group by user
and item
and then keep only event
rows in data.view
that occur after a fav
. We use cumsum
to count up instances of fav
and select all rows from the first instance of fav
onward.
The first set of code is for illustration, so you can see what the method is doing. The second set of code does the filtering directly.
library(tidyverse)
data.favorite %>% bind_rows(data.view) %>%
arrange(user, item, time) %>%
group_by(user, item) %>%
mutate(sequence = cumsum(event=="fav"))
user item time event sequence 1 1 A 1 view 0 2 1 A 2 fav 1 3 1 A 3 view 1 4 1 B 4 view 0 5 1 B 5 view 0 6 1 B 6 fav 1 7 1 B 7 view 1 8 1 C 8 view 0 9 2 D 9 fav 1 10 3 A 2 view 0 11 3 A 5 fav 1 12 3 A 9 view 1
data.favorite %>% bind_rows(data.view) %>%
arrange(user, item, time) %>%
group_by(user, item) %>%
filter(cumsum(event=="fav") >= 1, event=="view")
user item time event 1 1 A 3 view 2 1 B 7 view 3 3 A 9 view
Upvotes: 1
Reputation: 1576
Using match
with data.frames called data.view and data.fav:
#Find indices of matching users&items
Indices <- match(paste(data.view$user, data.view$item), paste(data.fav$user, data.fav$item))
#add corresponding fav time to data.view:
data.view$favtime <- data.fav$time[Indices]
#only keep rows in which time is greater than fav.time:
data.view <- data.view[data.view$time>data.view$favtime & !is.na(data.view$favtime),]
Upvotes: 1