npjc
npjc

Reputation: 4194

What is the fastest way to calculate how many rows in a matrix are all TRUE using R or RCpp?

Synopsis

I want to find the fastest way to calculate how many times a subset of cols defined by vec from a matrix of logicals are all TRUE:

Minimal example:

mlgl <- structure(c(FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, 
                    FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, 
                    FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, 
                    TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, 
                    FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, 
                    FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, 
                    TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, 
                    FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE), .Dim = c(15L, 5L
                    ), .Dimnames = list(NULL, c("l1", "l2", "l3", "l4", "l5")))
mlgl
#>          l1    l2    l3    l4    l5
#>  [1,] FALSE FALSE FALSE FALSE FALSE
#>  [2,]  TRUE FALSE FALSE FALSE  TRUE
#>  [3,] FALSE  TRUE FALSE FALSE FALSE
#>  [4,] FALSE FALSE  TRUE FALSE FALSE
#>  [5,]  TRUE  TRUE FALSE FALSE  TRUE
#>  [6,]  TRUE FALSE  TRUE FALSE  TRUE
#>  [7,]  TRUE FALSE FALSE  TRUE  TRUE
#>  [8,] FALSE  TRUE  TRUE FALSE FALSE
#>  [9,] FALSE  TRUE FALSE  TRUE FALSE
#> [10,] FALSE FALSE  TRUE  TRUE FALSE
#> [11,]  TRUE  TRUE  TRUE FALSE  TRUE
#> [12,]  TRUE  TRUE FALSE  TRUE  TRUE
#> [13,]  TRUE FALSE  TRUE  TRUE  TRUE
#> [14,] FALSE  TRUE  TRUE  TRUE FALSE
#> [15,]  TRUE  TRUE  TRUE  TRUE  TRUE

And a vector to subset by, defined by vec:

vec <- c("l1", "l3")

I want to know how many times all the variables in vec are TRUE. For this vec the expected answer is 4 (rows 6, 11, 13 and 15). the fastest way I can think of doing this is:

sum(rowSums(mlgl[,vec]) == length(vec))
#> [1] 4

compiler::cpmfun did not help with either of these:

microbenchmark(
  sum(apply(mlgl[, vec], 1, all)), 
  sum(rowSums(mlgl[,vec]) == length(vec)),
  unit = "eps"
  )
#> Unit: evaluations per second
#>                                      expr       min       lq     mean
#>           sum(apply(mlgl[, vec], 1, all))  4416.649 14013.85 13696.17
#>  sum(rowSums(mlgl[, vec]) == length(vec)) 27348.557 63477.96 67712.96
#>    median       uq      max neval cld
#>  14210.30 14397.81 14766.03   100  a 
#>  65017.46 75503.08 81175.42   100   b

I would love some alternative solutions or suggestions to do better than this in R or with RCpp.

Update: added some solutions that help out as an answer... another order of magnitude would be nice though.

Upvotes: 3

Views: 113

Answers (2)

npjc
npjc

Reputation: 4194

Update my current sol'n here:

mlgl <- structure(c(FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, 
                    FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, 
                    FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, 
                    TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, 
                    FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, 
                    FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, 
                    TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, 
                    FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE), .Dim = c(15L, 5L
                    ), .Dimnames = list(NULL, c("l1", "l2", "l3", "l4", "l5")))
vec <- c("l1", "l3")

initial sol'n

initial <- function() {
  sum(rowSums(mlgl[,vec]) == length(vec))
}

.Internal(... sol'n (which is not really allowed)

current <- function() {
  sml <- mlgl[,vec]
  dims <- dim(sml)
  sum(.Internal(rowSums(sml, dims[1], dims[2], FALSE)) == dims[2])
}

Thus my attempt at a simple solution with c++:

Rcpp::cppFunction('int cpp_sum_trues(LogicalMatrix x) {
  int nrow = x.nrow(), ncol = x.ncol();
  int out = 0;

  for (int i = 0; i < nrow; i++) {
    int total = 0;
    for (int j = 0; j < ncol; j++) {
      total += x(i, j);
    }
    if (total == ncol) {
      out += 1;
    }
  }
  return out;
}')
a_cpp_soln <- function() {
  sml <- mlgl[,vec]
  cpp_sum_trues(sml)
}

timings:

microbenchmark(initial(), current(), a_cpp_soln(), times = 1e3, unit = "eps")
#> Unit: evaluations per second
#>          expr      min        lq      mean    median        uq       max
#>     initial() 13468.01  69223.31  70388.61  71622.98  74239.05  81652.65
#>     current() 22163.12 161407.47 168268.59 169319.34 180619.56 211595.43
#>  a_cpp_soln() 28041.84 140007.02 151792.51 152288.15 167841.56 186950.83
#>  neval cld
#>   1000 a  
#>   1000   c
#>   1000  b

Upvotes: 3

Rich Scriven
Rich Scriven

Reputation: 99351

It's possible that we can improve speed here by using an integer vector to select the columns, instead of a character vector. With this method, there is no name matching or use of any attributes happening behind the scenes. We'll give fmatch() and match() a try.

The following line marked integer just shows how quick using an integer vector alone can be.

library(fastmatch)
microbenchmark(
    fmatch  = sum(rowSums(mlgl[, fmatch(vec, colnames(mlgl))]) == length(vec)),
    match   = sum(rowSums(mlgl[, match(vec, colnames(mlgl))]) == length(vec)),
    integer = sum(rowSums(mlgl[, c(1L, 3L)]) == length(vec)),
    unit = "eps"
 )
# Unit: evaluations per second
#     expr      min       lq     mean   median       uq      max neval
#   fmatch 16146.74 49468.25 50143.24 50823.34 52064.45 54404.00   100
#    match 45108.03 58503.55 59741.99 59724.68 61135.91 64930.85   100
#  integer 41023.96 80411.72 81827.19 83004.78 85429.93 88944.23   100

Actually is seems we didn't need to load fastmatch after all, as match() did better. Overall, using an integer vector instead of character name-matching definitely improves speed here.

I'm confident there will be a nice quick Rcpp answer posted shortly.

Update: Here's one more method using which() and length() that is also very good.

microbenchmark(
    which = length(which(rowSums(mlgl[, vec]) == length(vec))),
    unit = "eps"
)
# Unit: evaluations per second
#   expr      min       lq     mean   median      uq     max neval
#  which 26816.12 81502.91 81858.62 83156.76 84566.6 87850.3   100

Upvotes: 3

Related Questions