andrey
andrey

Reputation: 39

Vectorization while using which() function in R

I have 3 vectors and I want to apply separately on each of them the 'which()' function. I'm trying to find the max index of values less than some given number. How can I operate this task using vectorization?

my 3 vectors (may have various lengths)

vec1 <- c(1,2,3,4,5)
vec2 <- c(11,12,13)
vec3 <- c(1,2,3,4,5,6,7,8)

How can I vectorize it?

max(which(vec1<3))
max(which(vec2<12.3))
max(which(vec3<5.7))

The expected result is:

2
2
5

Upvotes: 2

Views: 123

Answers (2)

josliber
josliber

Reputation: 44340

One way to get a speedup would be to use Rcpp to search for elements smaller than your cutoff, starting from the right side of the vector and moving left. You can return as soon as you find the element that meets your criteria, which means that if your target is near the right side of the vector you might avoid looking at most of the vector's elements (meanwhile, which looks at all vector elements and max looks at all values returned by which). The speedup would be largest for long vectors where the target element is close to the end.

library(Rcpp)
rightmost.small <- cppFunction(
'double rightmostSmall(NumericVector x, const double cutoff) {
  for (int i=x.size()-1; i >= 0; --i) {
    if (x[i] < cutoff) return i+1;  // 1-index
  }
  return 0;  // None found
}')
rightmost.small(vec1, 3)
# [1] 2
rightmost.small(vec2, 12.3)
# [1] 2
rightmost.small(vec3, 5.7)
# [1] 5

Let's look at the performance for a vector where we expect this to give us a big speedup:

set.seed(144)
vec.large <- rnorm(1000000)
all.equal(max(which(vec.large < -1)), rightmost.small(vec.large, -1))
# [1] TRUE
library(microbenchmark)
microbenchmark(max(which(vec.large < -1)), rightmost.small(vec.large, -1))
# Unit: microseconds
#                            expr      min       lq        mean    median        uq       max neval
#      max(which(vec.large < -1)) 4912.016 8097.290 12816.36406 9189.0685 9883.9775 60405.585   100
#  rightmost.small(vec.large, -1)    1.643    2.476     8.54274    8.8915   12.8375    58.152   100

For this vector of length 1 million, we see a speedup of about 1000x using the Rcpp code.

This speedup should carry directly over to the case where you have many vectors stored in a list; you can use @JoshO'Brien's mapply code and observe a speedup when you switch from max(which(...)) to the Rcpp code:

f <- function(v,m) max(which(v < m))
l <- list(vec.large)[rep(1, 100)]
m <- rep(-1, 100)
microbenchmark(mapply(f, l, m), mapply(rightmost.small, l, m))
Unit: microseconds
                          expr        min          lq        mean     median          uq         max neval
               mapply(f, l, m) 865287.828 907893.8505 931448.1555 918637.343 935632.0505 1133909.950   100
 mapply(rightmost.small, l, m)    253.573    281.6855    344.5437    303.094    335.1675     712.897   100

We see a 3000x speedup by using the Rcpp code here.

Upvotes: 4

Josh O&#39;Brien
Josh O&#39;Brien

Reputation: 162421

l <- list(vec1,vec2,vec3)
m <- c(3, 12.3, 5.7)
f <- function(v,m) max(which(v < m))
mapply(f,l,m)
# [1] 2 2 5

Upvotes: 2

Related Questions