digestivee
digestivee

Reputation: 740

R - How to create all n-1 long subsets of a vector and save both the remaining vector and the removed vector efficiently?

I am toying around with building a recommender system. I have the historical purchases of certain users.

My data looks as

> head(baskets)
# A tibble: 6 x 2
# Groups:   user_id [2]
  user_id     basket
    <int>     <list>
1       8 <int [21]>
2       8 <int [13]>
3       8 <int [15]>
4      12 <int [22]>
5      12 <int [20]>
6      12 <int [17]>

> baskets$basket[[1]]
 [1]   651  1529  2078  6141  6473  9839 14992 16349 17794 20920 21903 
[12] 23165 23400 24838 28985 32030 34190 39110 39812 44099 49533

Okay so now I want to remove one item from each basket and save it as the target item, and save the rest of the basket as the new basket. This is to be repeated for all items in the basket. If we had for example a user with user_id = 1 and basket = [1,2,3] we would get

user_id   basket   target
      1      2,3        1
      1      1,3        2
      1      1,2        3

How can I construct such a data.frame / tibble in an efficient way? I have a solution but it seems to work quite slow, and since I have a large amount of data I would like to find a better solution if possible.

Currently I have

orderdf <- data.frame(user_id = integer(), basket = list(), target = 
integer())

for(k in 1:dim(baskets)[1]){
  print(k)
  currbasket <- baskets$basket[[k]]
  currbaskets <- lapply(1:length(currbasket), function(i) currbasket[i])
  curruser <- baskets$user_id[k]
  for(j in 1:length(currbaskets)){
    tempdf <- tibble(user_id = baskets$user_id[k], basket = 
                     list(currbaskets[[j]]), target = currbasket[j])
    orderdf <- rbind(orderdf, tempdf)
  }
}

Upvotes: 0

Views: 42

Answers (1)

Emmanuel-Lin
Emmanuel-Lin

Reputation: 1943

First I create myself a reproductable dataset

baskets <- data.frame(user_id = 1:10)
for (i in 1:nrow(df)){
  baskets$basket[i] = list(sample(1:100, 3, replace=F))
}
head(baskets)

Next time, please provide a reproductable set!

The next thing to do is to build a function to handle one line:

x = baskets[1,]
x$basket = x$basket[[1]]
require(data.table)
foraline <- function(x){
  n_inbasket <- length(unlist(x$basket))
  result <- data.table(user_id = rep(x$user_id, n_inbasket))
  result$basket <- sapply(1:n_inbasket, function(i){list(unlist(x$basket)[-i])})
  result$target <- x$basket
  return(result)
}
foraline(x)

Ok and now, we apply it on all lines and reduce it in one data.frame using rbindlist from data.table package.

require(data.table)
order_basket <- rbindlist(apply(baskets, 1, foraline))
head(order_basket)

Upvotes: 1

Related Questions