Aaron
Aaron

Reputation: 109

Faster alternatives to outer

I'm trying to make the following call to outer() substantially faster. Parallelizing via foreach is still prohibitively slow, so I'd like to attempt calling this in C++ using Rcpp but would love to hear any faster alternative.

Given a matrix mat and a list of matrix colnames col.list I am summarizing mat as such.

mycall <- function(mat, col.list) {
  outer(
    rownames(mat),
    col.list,
    Vectorize(function(x,y) {
      mean(mat[x,y])
    })
  )
}

For instance:

set.seed(123)
mat <- matrix(rnorm(100),nrow=10)
rownames(mat) <- letters[1:10]
colnames(mat) <- LETTERS[1:10]
mat
            A          B          C           D           E           F           G          H            I          J
a -0.56047565  1.2240818 -1.0678237  0.42646422 -0.69470698  0.25331851  0.37963948 -0.4910312  0.005764186  0.9935039
b -0.23017749  0.3598138 -0.2179749 -0.29507148 -0.20791728 -0.02854676 -0.50232345 -2.3091689  0.385280401  0.5483970
c  1.55870831  0.4007715 -1.0260044  0.89512566 -1.26539635 -0.04287046 -0.33320738  1.0057385 -0.370660032  0.2387317
d  0.07050839  0.1106827 -0.7288912  0.87813349  2.16895597  1.36860228 -1.01857538 -0.7092008  0.644376549 -0.6279061
e  0.12928774 -0.5558411 -0.6250393  0.82158108  1.20796200 -0.22577099 -1.07179123 -0.6880086 -0.220486562  1.3606524
f  1.71506499  1.7869131 -1.6866933  0.68864025 -1.12310858  1.51647060  0.30352864  1.0255714  0.331781964 -0.6002596
g  0.46091621  0.4978505  0.8377870  0.55391765 -0.40288484 -1.54875280  0.44820978 -0.2847730  1.096839013  2.1873330
h -1.26506123 -1.9666172  0.1533731 -0.06191171 -0.46665535  0.58461375  0.05300423 -1.2207177  0.435181491  1.5326106
i -0.68685285  0.7013559 -1.1381369 -0.30596266  0.77996512  0.12385424  0.92226747  0.1813035 -0.325931586 -0.2357004
j -0.44566197 -0.4727914  1.2538149 -0.38047100 -0.08336907  0.21594157  2.05008469 -0.1388914  1.148807618 -1.0264209

col.list <- replicate(5, sample(colnames(mat),sample(10,1)), simplify = F)
col.list
[[1]]
[1] "I" "H" "F" "C"

[[2]]
[1] "H" "C" "E" "D"

[[3]]
[1] "F" "A" "B" "C"

[[4]]
[1] "I" "G" "H" "F"

[[5]]
[1] "B" "F" "A" "D" "J"


mycall(mat, col.list)
             [,1]        [,2]        [,3]        [,4]        [,5]
 [1,] -0.32494304 -0.45677441 -0.03772476  0.03692275  0.46737855
 [2,] -0.54260254 -0.75753314 -0.02922133 -0.61368967  0.07088301
 [3,] -0.10844910 -0.09763415  0.22265121  0.06475016  0.61009334
 [4,]  0.14372171  0.40224937  0.20522554  0.07130067  0.36000416
 [5,] -0.43982636  0.17912380 -0.31934091 -0.55151435  0.30598183
 [6,]  0.29678266 -0.27389757  0.83293885  0.79433814  1.02136588
 [7,]  0.02527506  0.17601171  0.06195023 -0.07211925  0.43025291
 [8,] -0.01188734 -0.39897791 -0.62342288 -0.03697956 -0.23527315
 [9,] -0.28972770 -0.12070775 -0.24994491  0.22537340 -0.08066115
[10,]  0.61991819  0.16277087  0.13782578  0.81898563 -0.42188074

Upvotes: 1

Views: 139

Answers (1)

user13963867
user13963867

Reputation:

You could try:

sapply(col.list, function(v) rowMeans(mat[, v]))

I suspect the reason your solution is slow is Vectorize: it's a nice way to transform a scalar function into a vectorized function, but it has a huge cost: since it's based on mapply, it will call the function on each element, one by one. That is, one call to mean for each entry. If the outer result is large, that's going to be very costly. Instead, with the solution above, the code is at least vectorized in one direction, thanks to rowMeans.

Upvotes: 1

Related Questions