Faster alternatives to outer

Question

I'm trying to make the following call to outer() substantially faster. Parallelizing via foreach is still prohibitively slow, so I'd like to attempt calling this in C++ using Rcpp but would love to hear any faster alternative.

Given a matrix mat and a list of matrix colnames col.list I am summarizing mat as such.

mycall <- function(mat, col.list) {
  outer(
    rownames(mat),
    col.list,
    Vectorize(function(x,y) {
      mean(mat[x,y])
    })
  )
}

For instance:

set.seed(123)
mat <- matrix(rnorm(100),nrow=10)
rownames(mat) <- letters[1:10]
colnames(mat) <- LETTERS[1:10]
mat
            A          B          C           D           E           F           G          H            I          J
a -0.56047565  1.2240818 -1.0678237  0.42646422 -0.69470698  0.25331851  0.37963948 -0.4910312  0.005764186  0.9935039
b -0.23017749  0.3598138 -0.2179749 -0.29507148 -0.20791728 -0.02854676 -0.50232345 -2.3091689  0.385280401  0.5483970
c  1.55870831  0.4007715 -1.0260044  0.89512566 -1.26539635 -0.04287046 -0.33320738  1.0057385 -0.370660032  0.2387317
d  0.07050839  0.1106827 -0.7288912  0.87813349  2.16895597  1.36860228 -1.01857538 -0.7092008  0.644376549 -0.6279061
e  0.12928774 -0.5558411 -0.6250393  0.82158108  1.20796200 -0.22577099 -1.07179123 -0.6880086 -0.220486562  1.3606524
f  1.71506499  1.7869131 -1.6866933  0.68864025 -1.12310858  1.51647060  0.30352864  1.0255714  0.331781964 -0.6002596
g  0.46091621  0.4978505  0.8377870  0.55391765 -0.40288484 -1.54875280  0.44820978 -0.2847730  1.096839013  2.1873330
h -1.26506123 -1.9666172  0.1533731 -0.06191171 -0.46665535  0.58461375  0.05300423 -1.2207177  0.435181491  1.5326106
i -0.68685285  0.7013559 -1.1381369 -0.30596266  0.77996512  0.12385424  0.92226747  0.1813035 -0.325931586 -0.2357004
j -0.44566197 -0.4727914  1.2538149 -0.38047100 -0.08336907  0.21594157  2.05008469 -0.1388914  1.148807618 -1.0264209

col.list <- replicate(5, sample(colnames(mat),sample(10,1)), simplify = F)
col.list
[[1]]
[1] "I" "H" "F" "C"

[[2]]
[1] "H" "C" "E" "D"

[[3]]
[1] "F" "A" "B" "C"

[[4]]
[1] "I" "G" "H" "F"

[[5]]
[1] "B" "F" "A" "D" "J"


mycall(mat, col.list)
             [,1]        [,2]        [,3]        [,4]        [,5]
 [1,] -0.32494304 -0.45677441 -0.03772476  0.03692275  0.46737855
 [2,] -0.54260254 -0.75753314 -0.02922133 -0.61368967  0.07088301
 [3,] -0.10844910 -0.09763415  0.22265121  0.06475016  0.61009334
 [4,]  0.14372171  0.40224937  0.20522554  0.07130067  0.36000416
 [5,] -0.43982636  0.17912380 -0.31934091 -0.55151435  0.30598183
 [6,]  0.29678266 -0.27389757  0.83293885  0.79433814  1.02136588
 [7,]  0.02527506  0.17601171  0.06195023 -0.07211925  0.43025291
 [8,] -0.01188734 -0.39897791 -0.62342288 -0.03697956 -0.23527315
 [9,] -0.28972770 -0.12070775 -0.24994491  0.22537340 -0.08066115
[10,]  0.61991819  0.16277087  0.13782578  0.81898563 -0.42188074

user13963867 · Accepted Answer

You could try:

sapply(col.list, function(v) rowMeans(mat[, v]))

I suspect the reason your solution is slow is Vectorize: it's a nice way to transform a scalar function into a vectorized function, but it has a huge cost: since it's based on mapply, it will call the function on each element, one by one. That is, one call to mean for each entry. If the outer result is large, that's going to be very costly. Instead, with the solution above, the code is at least vectorized in one direction, thanks to rowMeans.

Faster alternatives to outer

Answers (1)

Related Questions