hjms
hjms

Reputation: 273

Vectorizing Operations on Vectors of Different Length

Let's say I have the following two arrays:

R <- 101
v <- array(0, dim <- c(R,2))
v[,1] <-runif(101)
t <- array(runif(5), dim <- c(5,2))

What I would like to do is to assign to each cell in the second column of v the outcome of the following function:

which.min(abs(v[r,1] - t[,1]))

So for each cell in the second column of v, I would have a 1,2,3,4 or 5. I know I can do this using a for loop over all rows r of v, but does someone know a way to vectorize this operation so that I don't have to resort to a (rather slow) for-loop?

Upvotes: 2

Views: 284

Answers (3)

Arun
Arun

Reputation: 118779

I think one could provide a vectorised solution using stepfun and combining with pmin and pmax all of which are vectorised. Its a bit of twisted/complicated logic, but its worth all the effort.

Advantages of using stepfun + pmin + pmax:

  • blazing fast (see benchmarking below)
  • not limited by the size of the matrix (see the error on a huge vector while running Jonathan's code)

First, the idea is inspired from Jonathan Chang's post here. Here the small variation is that you need the index rather than the difference. Also, I assume that all values are positive (from your runif input). You could extend this to vectors with negative inputs, but I leave that task to you if need be. Before I go to the code and benchmarking, let me explain what's the idea behind stepfun.

Assume you have two vectors x (equivalent to v[,1]) and y (equivalent to t[,1]). Now, let us sort y and create a stepfun on the sorted y in this manner:

y_sort <- sort(y)
step <- stepfun(y_sort, 0:length(y))

This helps us how exactly? Querying step(a) gives you the index of the largest value in y_sort that is < a. This might take a while to sink in. In other words, the value a lies in the position between step(a) and step(a) + 1 in the sorted y (y_sort). Now, the first thing we'll have to figure out is, which one of these two values is closest to a. This is achieved by extracting the indices step(a) and step(a)+1 and the values in y_sort corresponding to these indices and asking if the abs(a-y_sort[step(a)]) > abs(a - y_sort[step(a)+1]). If its false, then, step(a) is your index, and vice-versa. Second, getting back the original index from y from y_sort and this can be achieved by obtaining the corresponding sorted indices with the option index.return = TRUE in sort.

I agree this might be quite complicated to follow in this manner. But check the code and run it step by step and use the text above to follow it along (if necessary). The best part is that a can be a vector, so it is extremely fast! Now on to the code.

# vectorised solution using stepfun
vectorise_fun1 <- function(x, y) {
    y_sort <- sort(abs(y), index.return = TRUE)
    y_sval <- y_sort$x
    y_sidx <- y_sort$ix

    # stepfun
    step_fun <- stepfun(y_sval, 0:length(y))
    ix1      <- pmax(1, step_fun(x))
    ix2      <- pmin(length(y), 1 + ix1)
    iy       <- abs(x - y_sval[ix1]) > abs(x - y_sval[ix2])

    # construct output  
    res      <- rep(0, length(x))
    res[iy]  <- y_sidx[ix2[iy]]
    res[!iy] <- y_sidx[ix1[!iy]]
    res
}

# obtaining result
out_arun <- vectorise_fun1(v[,1], t[,1])
# (or) v[,2] <- vectorise_fun1(v[,1], t[,1])

# Are the results identical?
# Matthew's solution
vectorise_fun2 <- function(x, y) {
    res <- Vectorize(function(r) which.min(abs(x[r] - y)))(seq(length(x)))
}
out_matthew <- vectorise_fun2(v[,1], t[,1])

# Jonathan's solution
vectorise_fun3 <- function(x, y) {
    V  <- matrix(rep.int(x, length(y)), ncol = length(y))
    TT <- matrix(rep.int(y, length(x)), ncol = length(y), byrow = T)
    max.col(-abs(V-TT))
}
out_jonathan <- vectorise_fun3(v[,1], t[,1])

# Are the results identical?
> all(out_arun == out_matthew)
[1] TRUE
> all(out_arun == out_jonathan)
[1] TRUE

So, what's the point? All results are identical and the function with stepfun is huge and tricky to follow. Let's take a huge vector.

x <- runif(1e4)
y <- runif(1e3)

Now, let's benchmark to see the advantage:

require(rbenchmark)
> benchmark( out_arun <- vectorise_fun1(x,y), 
             out_matthew <- vectorise_fun2(x,y), 
             out_jonathan <- vectorise_fun3(x,y), 
             replications=1, order = "elapsed")

#                                   test replications elapsed relative user.self
# 1     out_arun <- vectorise_fun1(x, y)            1   0.004     1.00     0.005
# 2  out_matthew <- vectorise_fun2(x, y)            1   0.221    55.25     0.169
# 3 out_jonathan <- vectorise_fun3(x, y)            1   1.381   345.25     0.873

# Are the results identical?
> all(out_arun == out_matthew)
[1] TRUE
> all(out_arun == out_jonathan)
[1] TRUE

So, using step_fun is faster by a min of 55 times and a max of 345 times! Now, let's go for even bigger vectors.

x <- runif(1e5)
y <- runif(1e4)

require(rbenchmark)
> benchmark( out_arun <- vectorise_fun1(x,y), 
             out_matthew <- vectorise_fun2(x,y), 
             replications=1, order = "elapsed")

#                                  test replications elapsed relative user.self
# 1    out_arun <- vectorise_fun1(x, y)            1   0.052    1.000     0.043
# 2 out_matthew <- vectorise_fun2(x, y)            1  16.668  320.538    11.849

Jonathan's function resulted in allocation error:

Error in rep.int(x, length(y)) : 
     cannot allocate vector of length 1000000000

And the speed up is 320 times here.

Upvotes: 1

Jonathan
Jonathan

Reputation: 5278

You can expand v and t:

V <- matrix(rep.int(v[,1],dim(t)[[1]]),ncol=dim(t)[[1]])
TT <- matrix(rep.int(t[,1],dim(v)[[1]]),ncol=dim(t)[[1]],byrow=T)

and then subtract and take the maximum value of each column:

max.col(-abs(V-TT))

Upvotes: 2

Matthew Lundberg
Matthew Lundberg

Reputation: 42629

Not really vectorized despite of the name, as Vectorize calls lapply. But this gives the result:

> Vectorize(function(r) which.min(abs(v[r,1] - t[,1])))(seq(nrow(v)))
##   [1] 4 3 3 2 5 5 2 5 2 5 3 3 2 5 1 4 5 5 4 3 3 5 5 2 4 2 2 4 4 3 2 4 5 2
##  [35] 2 3 2 4 4 1 5 5 2 3 2 4 5 5 3 5 2 4 4 2 4 5 5 5 5 5 4 3 3 5 5 3 2 3
##  [69] 5 3 5 3 3 5 4 5 5 3 1 2 5 5 2 3 3 4 3 3 4 5 4 2 2 3 4 2 5 5 5 5 2

This value can then be assigned to v[,2].

Upvotes: 2

Related Questions