nate
nate

Reputation: 977

Fastest way to recycle vector along matrix rows

I have a n x m matrix mat and a vector vec of length m. Is there a fast way to subtract this vector from each row of the matrix, recycling it as necessary.

Example:

mat = structure(c(3.01, 1.44, 3.31, 1.34, 3.79, 1.65, 3.06, 1.12, 2.34, 
0.27, 2.63, 0.63, 2.73, 0.94, 3.1, 1.34, 2.75, 0.75, 2.83, 0.58
), .Dim = c(2L, 10L))

vec = colMeans(mat)

Whats the fastest way to subtract vec from each row of mat? aperm seems very inefficient.

aperm(aperm(mat) - vec)

Allocating a second matrix doesn't seem too spiffy either.

mat - matrix(vec, ncol=ncol(mat), nrow = nrow(mat), by.row =T)

Note: This question is a repost of a previous question closed as a duplicate. Unfortunately the duplicates lacked relevant, thorough answers so I decided to add it along with an answer.

Upvotes: 3

Views: 517

Answers (1)

nate
nate

Reputation: 977

library(microbenchmark)

byrow.speed.benchmark = function(ncol, nrow) {
  mat = matrix(rnorm(nrow * ncol), nrow = nrow, ncol = ncol)
  vec = colSums(mat)


  microbenchmark(
    aperm(aperm(mat) - vec),
    t(t(mat) - vec),
    mat - matrix(vec, ncol=ncol(mat), nrow = nrow(mat), byrow =T),
    sweep(mat, 2, vec),
    mat - rep(vec, each = nrow(mat)),
    #mat %*% diag(vec),
    mat - vec[col(mat)],
    mat - vec,
    times = 300
  )
}


byrow.speed.benchmark(10, 10)

Comparing several methods of applying across matrix rows we find that allocating a vector is the fastest.

Unit: nanoseconds
                                                             expr   min    lq      mean median    uq   max neval
                                          aperm(aperm(mat) - vec)  8642  9283 10214.287   9923 10243 80344   300
                                                  t(t(mat) - vec)  6722  7362  7950.130   8002  8323 27208   300
 mat - matrix(vec, ncol = ncol(mat), nrow = nrow(mat), byrow = T)  3201  3841  4282.947   4161  4482 20486   300
                                               sweep(mat, 2, vec) 26888 28489 30016.310  29448 30089 85145   300
                                 mat - rep(vec, each = nrow(mat))  2560  3201  3481.630   3521  3841 10883   300
                                              mat - vec[col(mat)]  1600  2241  2594.970   2561  2881  6081   300
                                                        mat - vec     0   320   389.530    320   321  1921   300

How does this scale?

ncols = floor(10^((4:12)/4))
nrows = floor(10^((4:12)/4))

results = cbind(expand.grid(ncols, nrows), aperm = NA, t=NA, alloc = NA, sweep = NA, rep = NA,  indices=NA, control = NA)

for (i in seq(nrow(results))) {
  df = byrow.speed.benchmark(results[i,1], results[i,2])

  results[i,3:9] = sapply(split(df$time, as.numeric(df$expr)), mean)
}

library(ggplot2)

df = reshape2::melt(results, id.vars= c("Var1", "Var2"))

colnames(df) = c("ncol", "nrow", "method", "meantime")



ggplot(subset(df, ncol==1000)) + geom_point(aes(x = log10(ncol*nrow), y=meantime, colour = method))+ geom_line(aes(x = log10(ncol*nrow), y=meantime, colour = method))  + ggtitle("Scaling with cell number.") + coord_cartesian(ylim = c(0, 1E6))
ggplot(subset(df, ncol==1000)) + geom_point(aes(x = log10(ncol*nrow), y=meantime, colour = method))+ geom_line(aes(x = log10(ncol*nrow), y=meantime, colour = method))  + ggtitle("Scaling with cell number.") #+ coord_cartesian(ylim = c(0, 5E7))


ggplot(subset(df, ncol==1000)) + geom_point(aes(x = log10(ncol*nrow), y=meantime, colour = method))+ geom_line(aes(x = log10(ncol*nrow), y=meantime, colour = method)) + coord_cartesian(ylim = c(0, 3E7)) + ggtitle("Scaling with a wide matrix (1000 columns)")
ggplot(subset(df, nrow==1000)) + geom_point(aes(x = log10(ncol*nrow), y=meantime, colour = method))+ geom_line(aes(x = log10(ncol*nrow), y=meantime, colour = method)) + coord_cartesian(ylim = c(0, 3E7)) + ggtitle("Scaling with a tall matrix (1000 rows)")

The pink line is the case where we apply the vector over the columns with built in recycling. Allocating a matrix with matrix(vec, byrow=T) scales the best of our options.

Scaling with cell number Scaling with cell number, small number of cells

On the off chance that the matrix dimensions affected this here is scaling for a wide and a tall matrix.

Wide matrix Tall matrix

Edit: It's worth noting that (as expected) the matrix allocation does not scale as well as vector recycling. The above plots are slightly misleading in that regard.

Matrix vs recycling

Upvotes: 3

Related Questions