Reputation: 977
I have a n x m matrix mat
and a vector vec
of length m. Is there a fast way to subtract this vector from each row of the matrix, recycling it as necessary.
Example:
mat = structure(c(3.01, 1.44, 3.31, 1.34, 3.79, 1.65, 3.06, 1.12, 2.34,
0.27, 2.63, 0.63, 2.73, 0.94, 3.1, 1.34, 2.75, 0.75, 2.83, 0.58
), .Dim = c(2L, 10L))
vec = colMeans(mat)
Whats the fastest way to subtract vec
from each row of mat
? aperm
seems very inefficient.
aperm(aperm(mat) - vec)
Allocating a second matrix doesn't seem too spiffy either.
mat - matrix(vec, ncol=ncol(mat), nrow = nrow(mat), by.row =T)
Note: This question is a repost of a previous question closed as a duplicate. Unfortunately the duplicates lacked relevant, thorough answers so I decided to add it along with an answer.
Upvotes: 3
Views: 517
Reputation: 977
library(microbenchmark)
byrow.speed.benchmark = function(ncol, nrow) {
mat = matrix(rnorm(nrow * ncol), nrow = nrow, ncol = ncol)
vec = colSums(mat)
microbenchmark(
aperm(aperm(mat) - vec),
t(t(mat) - vec),
mat - matrix(vec, ncol=ncol(mat), nrow = nrow(mat), byrow =T),
sweep(mat, 2, vec),
mat - rep(vec, each = nrow(mat)),
#mat %*% diag(vec),
mat - vec[col(mat)],
mat - vec,
times = 300
)
}
byrow.speed.benchmark(10, 10)
Comparing several methods of applying across matrix rows we find that allocating a vector is the fastest.
Unit: nanoseconds
expr min lq mean median uq max neval
aperm(aperm(mat) - vec) 8642 9283 10214.287 9923 10243 80344 300
t(t(mat) - vec) 6722 7362 7950.130 8002 8323 27208 300
mat - matrix(vec, ncol = ncol(mat), nrow = nrow(mat), byrow = T) 3201 3841 4282.947 4161 4482 20486 300
sweep(mat, 2, vec) 26888 28489 30016.310 29448 30089 85145 300
mat - rep(vec, each = nrow(mat)) 2560 3201 3481.630 3521 3841 10883 300
mat - vec[col(mat)] 1600 2241 2594.970 2561 2881 6081 300
mat - vec 0 320 389.530 320 321 1921 300
How does this scale?
ncols = floor(10^((4:12)/4))
nrows = floor(10^((4:12)/4))
results = cbind(expand.grid(ncols, nrows), aperm = NA, t=NA, alloc = NA, sweep = NA, rep = NA, indices=NA, control = NA)
for (i in seq(nrow(results))) {
df = byrow.speed.benchmark(results[i,1], results[i,2])
results[i,3:9] = sapply(split(df$time, as.numeric(df$expr)), mean)
}
library(ggplot2)
df = reshape2::melt(results, id.vars= c("Var1", "Var2"))
colnames(df) = c("ncol", "nrow", "method", "meantime")
ggplot(subset(df, ncol==1000)) + geom_point(aes(x = log10(ncol*nrow), y=meantime, colour = method))+ geom_line(aes(x = log10(ncol*nrow), y=meantime, colour = method)) + ggtitle("Scaling with cell number.") + coord_cartesian(ylim = c(0, 1E6))
ggplot(subset(df, ncol==1000)) + geom_point(aes(x = log10(ncol*nrow), y=meantime, colour = method))+ geom_line(aes(x = log10(ncol*nrow), y=meantime, colour = method)) + ggtitle("Scaling with cell number.") #+ coord_cartesian(ylim = c(0, 5E7))
ggplot(subset(df, ncol==1000)) + geom_point(aes(x = log10(ncol*nrow), y=meantime, colour = method))+ geom_line(aes(x = log10(ncol*nrow), y=meantime, colour = method)) + coord_cartesian(ylim = c(0, 3E7)) + ggtitle("Scaling with a wide matrix (1000 columns)")
ggplot(subset(df, nrow==1000)) + geom_point(aes(x = log10(ncol*nrow), y=meantime, colour = method))+ geom_line(aes(x = log10(ncol*nrow), y=meantime, colour = method)) + coord_cartesian(ylim = c(0, 3E7)) + ggtitle("Scaling with a tall matrix (1000 rows)")
The pink line is the case where we apply the vector over the columns with built in recycling. Allocating a matrix with matrix(vec, byrow=T)
scales the best of our options.
On the off chance that the matrix dimensions affected this here is scaling for a wide and a tall matrix.
Edit: It's worth noting that (as expected) the matrix allocation does not scale as well as vector recycling. The above plots are slightly misleading in that regard.
Upvotes: 3