Reputation: 273
Let's say I have the following two arrays:
R <- 101
v <- array(0, dim <- c(R,2))
v[,1] <-runif(101)
t <- array(runif(5), dim <- c(5,2))
What I would like to do is to assign to each cell in the second column of v the outcome of the following function:
which.min(abs(v[r,1] - t[,1]))
So for each cell in the second column of v, I would have a 1,2,3,4 or 5. I know I can do this using a for loop over all rows r of v, but does someone know a way to vectorize this operation so that I don't have to resort to a (rather slow) for-loop?
Upvotes: 2
Views: 284
Reputation: 118779
I think one could provide a vectorised solution using stepfun
and combining with pmin
and pmax
all of which are vectorised. Its a bit of twisted/complicated logic, but its worth all the effort.
Advantages of using stepfun
+ pmin
+ pmax
:
First, the idea is inspired from Jonathan Chang's
post here. Here the small variation is that you need the index rather than the difference. Also, I assume that all values are positive (from your runif
input). You could extend this to vectors with negative inputs, but I leave that task to you if need be. Before I go to the code and benchmarking, let me explain what's the idea behind stepfun
.
Assume you have two vectors x
(equivalent to v[,1]
) and y
(equivalent to t[,1]
). Now, let us sort y
and create a stepfun
on the sorted y
in this manner:
y_sort <- sort(y)
step <- stepfun(y_sort, 0:length(y))
This helps us how exactly? Querying step(a)
gives you the index of the largest value in y_sort
that is < a
. This might take a while to sink in. In other words, the value a
lies in the position between step(a)
and step(a) + 1
in the sorted y (y_sort)
. Now, the first thing we'll have to figure out is, which one of these two values is closest to a
. This is achieved by extracting the indices step(a)
and step(a)+1
and the values in y_sort
corresponding to these indices and asking if the abs(a-y_sort[step(a)]) > abs(a - y_sort[step(a)+1])
. If its false, then, step(a)
is your index, and vice-versa. Second, getting back the original index from y
from y_sort
and this can be achieved by obtaining the corresponding sorted indices with the option index.return = TRUE
in sort
.
I agree this might be quite complicated to follow in this manner. But check the code and run it step by step and use the text above to follow it along (if necessary). The best part is that a
can be a vector, so it is extremely fast! Now on to the code.
# vectorised solution using stepfun
vectorise_fun1 <- function(x, y) {
y_sort <- sort(abs(y), index.return = TRUE)
y_sval <- y_sort$x
y_sidx <- y_sort$ix
# stepfun
step_fun <- stepfun(y_sval, 0:length(y))
ix1 <- pmax(1, step_fun(x))
ix2 <- pmin(length(y), 1 + ix1)
iy <- abs(x - y_sval[ix1]) > abs(x - y_sval[ix2])
# construct output
res <- rep(0, length(x))
res[iy] <- y_sidx[ix2[iy]]
res[!iy] <- y_sidx[ix1[!iy]]
res
}
# obtaining result
out_arun <- vectorise_fun1(v[,1], t[,1])
# (or) v[,2] <- vectorise_fun1(v[,1], t[,1])
# Are the results identical?
# Matthew's solution
vectorise_fun2 <- function(x, y) {
res <- Vectorize(function(r) which.min(abs(x[r] - y)))(seq(length(x)))
}
out_matthew <- vectorise_fun2(v[,1], t[,1])
# Jonathan's solution
vectorise_fun3 <- function(x, y) {
V <- matrix(rep.int(x, length(y)), ncol = length(y))
TT <- matrix(rep.int(y, length(x)), ncol = length(y), byrow = T)
max.col(-abs(V-TT))
}
out_jonathan <- vectorise_fun3(v[,1], t[,1])
# Are the results identical?
> all(out_arun == out_matthew)
[1] TRUE
> all(out_arun == out_jonathan)
[1] TRUE
So, what's the point? All results are identical and the function with stepfun
is huge and tricky to follow. Let's take a huge vector.
x <- runif(1e4)
y <- runif(1e3)
Now, let's benchmark to see the advantage:
require(rbenchmark)
> benchmark( out_arun <- vectorise_fun1(x,y),
out_matthew <- vectorise_fun2(x,y),
out_jonathan <- vectorise_fun3(x,y),
replications=1, order = "elapsed")
# test replications elapsed relative user.self
# 1 out_arun <- vectorise_fun1(x, y) 1 0.004 1.00 0.005
# 2 out_matthew <- vectorise_fun2(x, y) 1 0.221 55.25 0.169
# 3 out_jonathan <- vectorise_fun3(x, y) 1 1.381 345.25 0.873
# Are the results identical?
> all(out_arun == out_matthew)
[1] TRUE
> all(out_arun == out_jonathan)
[1] TRUE
So, using step_fun
is faster by a min of 55 times and a max of 345 times! Now, let's go for even bigger vectors.
x <- runif(1e5)
y <- runif(1e4)
require(rbenchmark)
> benchmark( out_arun <- vectorise_fun1(x,y),
out_matthew <- vectorise_fun2(x,y),
replications=1, order = "elapsed")
# test replications elapsed relative user.self
# 1 out_arun <- vectorise_fun1(x, y) 1 0.052 1.000 0.043
# 2 out_matthew <- vectorise_fun2(x, y) 1 16.668 320.538 11.849
Jonathan's function resulted in allocation error:
Error in rep.int(x, length(y)) :
cannot allocate vector of length 1000000000
And the speed up is 320 times here.
Upvotes: 1
Reputation: 5278
You can expand v
and t
:
V <- matrix(rep.int(v[,1],dim(t)[[1]]),ncol=dim(t)[[1]])
TT <- matrix(rep.int(t[,1],dim(v)[[1]]),ncol=dim(t)[[1]],byrow=T)
and then subtract and take the maximum value of each column:
max.col(-abs(V-TT))
Upvotes: 2
Reputation: 42629
Not really vectorized despite of the name, as Vectorize
calls lapply
. But this gives the result:
> Vectorize(function(r) which.min(abs(v[r,1] - t[,1])))(seq(nrow(v)))
## [1] 4 3 3 2 5 5 2 5 2 5 3 3 2 5 1 4 5 5 4 3 3 5 5 2 4 2 2 4 4 3 2 4 5 2
## [35] 2 3 2 4 4 1 5 5 2 3 2 4 5 5 3 5 2 4 4 2 4 5 5 5 5 5 4 3 3 5 5 3 2 3
## [69] 5 3 5 3 3 5 4 5 5 3 1 2 5 5 2 3 3 4 3 3 4 5 4 2 2 3 4 2 5 5 5 5 2
This value can then be assigned to v[,2
].
Upvotes: 2