I have two vectors, A and B . For every element in A I want to find the index of the first element in B that is greater and has higher index. The length of A and B are the same. So for vectors: A <- c(10, 5, 3, 4, 7) B <- c(4, 8, 11, 1, 5) I want a result vector: R <- c(3, 3, 5, 5, NA) Of course I can do it with two loops, but it's very slow, and I don't know how to use apply() in this situation, when the indices matter. My data set has vectors of length 20000, so the speed is really important in this case. A few bonus questions: What if I have a sequence of numbers (like seq = 2:10 ), and I want to find the first number in B that is higher than a+s for every a of A and every s of seq. Like with question 1), but I want to know the first greater, and the first lower value, and create a matrix, which stores which one was first. So for example I have a of A , and 10 from seq . I want to find the first value of B , which is higher than a+10 , or lower than a-10 , and then store it's index and value.

Reputation: 395

Find first greater element with higher index

I have two vectors, A and B. For every element in A I want to find the index of the first element in B that is greater and has higher index. The length of A and B are the same.

So for vectors:

A <- c(10, 5, 3, 4, 7)

B <- c(4, 8, 11, 1, 5)

I want a result vector:

R <- c(3, 3, 5, 5, NA)

Of course I can do it with two loops, but it's very slow, and I don't know how to use apply() in this situation, when the indices matter. My data set has vectors of length 20000, so the speed is really important in this case.

A few bonus questions:

What if I have a sequence of numbers (like seq = 2:10), and I want to find the first number in B that is higher than a+s for every a of A and every s of seq.
Like with question 1), but I want to know the first greater, and the first lower value, and create a matrix, which stores which one was first. So for example I have a of A, and 10 from seq. I want to find the first value of B, which is higher than a+10, or lower than a-10, and then store it's index and value.

Upvotes: 7

Answers (2)

Ricardo Saporta

Reputation: 55390

This is a great example of when sapply is less efficient than loops. Although the sapply does make the code look neater, you are paying for that neatness with time.

Instead you can wrap a while loop inside a for loop inside a nice, neat function.

Here are benchmarks comparing a nested-apply loop against nested for-while loop (and a mixed apply-while loop, for good measure). Update: added the vapply..match.. mentioned in comments. Faster than sapply, but still much slower than while loop.

BENCHMARK:

           test elapsed relative
1     for.while   0.069    1.000
2  sapply.while   0.080    1.159
3  vapply.match   0.101    1.464
4 nested.sapply   0.104    1.507

Notice you save a third of your time; The savings will likely be larger when you start adding the sequences to A.

For the second part of your question:

If you have this all wrapped up in an nice function, it is easy to add a seq to A

# Sample data
A <- c(10, 5, 3, 4, 7, 100, 2)
B <- c(4, 8, 11, 1, 5, 18, 20)

# Sample sequence
S <- seq(1, 12, 3)

# marix with all index values (with names cleaned up)   
indexesOfB <- t(sapply(S, function(s) findIndx(A+s, B)))
dimnames(indexesOfB) <- list(S, A)

Lastly, if you want to instead find values of B less than A, just swap the operation in the function.
(You could include an if-clause in the function and use only a single function. I find it more efficient to have two separate functions)

findIndx.gt(A, B)   #  [1]  3  3  5  5  6 NA  8 NA NA
findIndx.lt(A, B)   #  [1]  2  4  4 NA  8  7 NA NA NA

Then you can wrap it up in one nice pacakge

rangeFindIndx(A, B, S)
 #     A   S  indxB.gt indxB.lt
 #    10   1        3        2
 #     5   1        3        4
 #     3   1        5        4
 #     4   1        5       NA
 #     7   1        6       NA
 #   100   1       NA       NA
 #     2   1       NA       NA
 #    10   4        6        4
 #     5   4        3        4
 #   ...

FUNCTIONS

(Notice they depend on reshape2)

rangeFindIndx <- function(A, B, S) {
  # For each s in S, and for each a in A,
  # find the first value of B, which is higher than a+s, or lower than a-s

  require(reshape2)

  # Create gt & lt matricies;  add dimnames for melting function
  indexesOfB.gt <- sapply(S, function(s) findIndx.gt(A+s, B))
  indexesOfB.lt <- sapply(S, function(s) findIndx.lt(A-s, B))
  dimnames(indexesOfB.gt) <- dimnames(indexesOfB.gt) <- list(A, S)

  # melt the matricies and combine into one
  gtltMatrix <- cbind(melt(indexesOfB.gt), melt(indexesOfB.lt)$value)

  # clean up their names
  names(gtltMatrix) <- c("A", "S", "indxB.gt", "indxB.lt")

  return(gtltMatrix)
}

findIndx.gt <- function(A, B) {
  lng <- length(A)
  ret <- integer(0)
  b <- NULL
  for (j in seq(lng-1)) {
    i <- j + 1
    while (i <= lng && ((b <- B[[i]]) < A[[j]]) ) {
      i <- i + 1
    }
    ret <- c(ret, ifelse(i<lng, i, NA))
  }
  c(ret, NA)  
}

findIndx.lt <- function(A, B) {
  lng <- length(A)
  ret <- integer(0)
  b <- NULL
  for (j in seq(lng-1)) {
    i <- j + 1
    while (i <= lng && ((b <- B[[i]]) > A[[j]]) ) {   # this line contains the only difference from findIndx.gt
      i <- i + 1
    }
    ret <- c(ret, ifelse(i<lng, i, NA))
  }
  c(ret, NA)  
}

Upvotes: 6

James

Reputation: 66844

sapply(sapply(seq_along(a),function(x) which(b[-seq(x)]>a[x])+x),"[",1)
[1]  3  3  5  5 NA