Reputation: 958
I'm trying to speed up a workflow that involves multiplying rows from two data frames together via a custom function.
Right now I'm using apply() with a custom function. My understanding is that lapply() or sapply() would be faster (and eventually allow parallelization, though I'd prefer a speedup that does not depend on parallel processing), but I can't figure out the lapply() or sapply() syntax I should use with my custom function. If there's an even simpler way to vectorize the custom function and avoid *apply() altogether, that would be preferred.
The number of rows in my use case will be 1 million or more, and the number of columns will be around 15, but here's a MWE that illustrates the speed issue:
# Two data frames that will be used in the calculation. d2 can be a matrix, but d1 must be a data frame.
d1 <- data.frame(V1 = runif(1000), V2 = runif(1000), V3 = runif(1000), V4 = runif(1000))
d2 <- data.frame(Va = runif(3), V1 = runif(3), V2 = runif(3), V3 = runif(3), V4 = runif(3))
# Custom function that is applied to each row in d1
manualprob <- function(x){
xb1 <- as.numeric(rowSums(d2[1,2:ncol(d2)] * x) + d2[1,1])
xb2 <- as.numeric(rowSums(d2[2,2:ncol(d2)] * x) + d2[2,1])
xb3 <- as.numeric(rowSums(d2[3,2:ncol(d2)] * x) + d2[3,1])
denom <- 1 + exp(xb1) + exp(xb2) + exp(xb3)
prob <- exp(xb1)/denom
return(prob)
}
# apply() used below, but it is too slow
start_time <- proc.time()
d1$prob <- as.vector(apply(d1, 1, manualprob))
proc.time() - start_time
user system elapsed
1.081 0.007 1.088
Upvotes: 1
Views: 122
Reputation: 18425
Your best bet is to convert to matrices and use R's very fast matrix operations...
You can create all of the xb figures in one go with
xb <- as.matrix(d2[, -1]) %*% t(as.matrix(d1)) + d2[, 1]
This produces a 3*1000 matrix.
And then you can get the probabilities with
prob <- exp(xb[1, ]) / (1 + colSums(exp(xb)))
This all takes almost zero time on my machine!
Upvotes: 3