Reputation: 4194
I want to find the fastest way to calculate how many times a subset of
cols defined by vec
from a matrix of logicals are all TRUE
:
mlgl <- structure(c(FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE,
FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE,
TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE,
FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE,
TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE), .Dim = c(15L, 5L
), .Dimnames = list(NULL, c("l1", "l2", "l3", "l4", "l5")))
mlgl
#> l1 l2 l3 l4 l5
#> [1,] FALSE FALSE FALSE FALSE FALSE
#> [2,] TRUE FALSE FALSE FALSE TRUE
#> [3,] FALSE TRUE FALSE FALSE FALSE
#> [4,] FALSE FALSE TRUE FALSE FALSE
#> [5,] TRUE TRUE FALSE FALSE TRUE
#> [6,] TRUE FALSE TRUE FALSE TRUE
#> [7,] TRUE FALSE FALSE TRUE TRUE
#> [8,] FALSE TRUE TRUE FALSE FALSE
#> [9,] FALSE TRUE FALSE TRUE FALSE
#> [10,] FALSE FALSE TRUE TRUE FALSE
#> [11,] TRUE TRUE TRUE FALSE TRUE
#> [12,] TRUE TRUE FALSE TRUE TRUE
#> [13,] TRUE FALSE TRUE TRUE TRUE
#> [14,] FALSE TRUE TRUE TRUE FALSE
#> [15,] TRUE TRUE TRUE TRUE TRUE
And a vector to subset by, defined by vec
:
vec <- c("l1", "l3")
I want to know how many times all the variables in vec
are TRUE
. For
this vec the expected answer is 4
(rows 6, 11, 13 and 15). the fastest
way I can think of doing this is:
sum(rowSums(mlgl[,vec]) == length(vec))
#> [1] 4
compiler::cpmfun did not help with either of these:
microbenchmark(
sum(apply(mlgl[, vec], 1, all)),
sum(rowSums(mlgl[,vec]) == length(vec)),
unit = "eps"
)
#> Unit: evaluations per second
#> expr min lq mean
#> sum(apply(mlgl[, vec], 1, all)) 4416.649 14013.85 13696.17
#> sum(rowSums(mlgl[, vec]) == length(vec)) 27348.557 63477.96 67712.96
#> median uq max neval cld
#> 14210.30 14397.81 14766.03 100 a
#> 65017.46 75503.08 81175.42 100 b
I would love some alternative solutions or suggestions to do better than this in R or with RCpp.
Update: added some solutions that help out as an answer... another order of magnitude would be nice though.
Upvotes: 3
Views: 113
Reputation: 4194
mlgl <- structure(c(FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE,
FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE,
TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE,
FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE,
TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE), .Dim = c(15L, 5L
), .Dimnames = list(NULL, c("l1", "l2", "l3", "l4", "l5")))
vec <- c("l1", "l3")
initial sol'n
initial <- function() {
sum(rowSums(mlgl[,vec]) == length(vec))
}
.Internal(...
sol'n (which is not really allowed)
current <- function() {
sml <- mlgl[,vec]
dims <- dim(sml)
sum(.Internal(rowSums(sml, dims[1], dims[2], FALSE)) == dims[2])
}
Thus my attempt at a simple solution with c++:
Rcpp::cppFunction('int cpp_sum_trues(LogicalMatrix x) {
int nrow = x.nrow(), ncol = x.ncol();
int out = 0;
for (int i = 0; i < nrow; i++) {
int total = 0;
for (int j = 0; j < ncol; j++) {
total += x(i, j);
}
if (total == ncol) {
out += 1;
}
}
return out;
}')
a_cpp_soln <- function() {
sml <- mlgl[,vec]
cpp_sum_trues(sml)
}
timings:
microbenchmark(initial(), current(), a_cpp_soln(), times = 1e3, unit = "eps")
#> Unit: evaluations per second
#> expr min lq mean median uq max
#> initial() 13468.01 69223.31 70388.61 71622.98 74239.05 81652.65
#> current() 22163.12 161407.47 168268.59 169319.34 180619.56 211595.43
#> a_cpp_soln() 28041.84 140007.02 151792.51 152288.15 167841.56 186950.83
#> neval cld
#> 1000 a
#> 1000 c
#> 1000 b
Upvotes: 3
Reputation: 99351
It's possible that we can improve speed here by using an integer vector to select the columns, instead of a character vector. With this method, there is no name matching or use of any attributes happening behind the scenes. We'll give fmatch()
and match()
a try.
The following line marked integer
just shows how quick using an integer vector alone can be.
library(fastmatch)
microbenchmark(
fmatch = sum(rowSums(mlgl[, fmatch(vec, colnames(mlgl))]) == length(vec)),
match = sum(rowSums(mlgl[, match(vec, colnames(mlgl))]) == length(vec)),
integer = sum(rowSums(mlgl[, c(1L, 3L)]) == length(vec)),
unit = "eps"
)
# Unit: evaluations per second
# expr min lq mean median uq max neval
# fmatch 16146.74 49468.25 50143.24 50823.34 52064.45 54404.00 100
# match 45108.03 58503.55 59741.99 59724.68 61135.91 64930.85 100
# integer 41023.96 80411.72 81827.19 83004.78 85429.93 88944.23 100
Actually is seems we didn't need to load fastmatch after all, as match()
did better. Overall, using an integer vector instead of character name-matching definitely improves speed here.
I'm confident there will be a nice quick Rcpp answer posted shortly.
Update: Here's one more method using which()
and length()
that is also very good.
microbenchmark(
which = length(which(rowSums(mlgl[, vec]) == length(vec))),
unit = "eps"
)
# Unit: evaluations per second
# expr min lq mean median uq max neval
# which 26816.12 81502.91 81858.62 83156.76 84566.6 87850.3 100
Upvotes: 3