Reputation: 8117
I have a vector
set.seed(2)
x <- sample.int(20, 5)
[1] 4 14 11 3 16
Now, for every element I want to find
the element with the minimum distance (min(abs(x[i]-x[-i]))
for element i
), which here would be
[1] 3 16 14 4 14
the (first) index of the element with the minimum distance, which here would be
[1] 4 5 2 1 2
The point is that the element itself is not considered, but only all the other elements, which is why this R - Fastest way to find nearest value in vector is not the answer.
If the actual answer is out there, sorry - I didn't find it.
Upvotes: 4
Views: 531
Reputation: 269481
1) Rfast Using dista
in Rfast we get the indexes of the closest two. Take the second closest as the closest will be the same value.
library(Rfast)
x <- c(4, 14, 11, 3, 16) # input
x[ dista(x, x, k = 2, index = TRUE)[, 2] ]
## [1] 3 16 14 4 14
2) sqldf Using SQL we can left join DF to itself excluding the same value value and take the row with the minimum distance.
DF <- data.frame(x) # x is from (1)
sqldf("select a.x, b.x nearest, min(abs(a.x - b.x))
from DF a
left join DF b on a.x != b.x
group by a.rowid")[1:2]
giving:
x nearest
1 4 3
2 14 16
3 11 14
4 3 4
5 16 14
3) zoo Sort the input, take the value corresponding to the least difference on either of side of each element and order it back.
library(zoo)
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
rollapply(c(-Inf, x[ix], Inf), 3, least)[order(ix)]
## [1] 3 16 14 4 14
4) Base R Using ix
and least
from (3) we can mimic (3) using only base functions as follows.
apply(embed(c(-Inf, x[ix], Inf), 3)[, 3:1], 1, least)[order(ix)]
## [1] 3 16 14 4 14
4a) This slightly shorter variation would also work:
-apply(embed(-c(-Inf, x[ix], Inf), 3), 1, least)[order(ix)]
## [1] 3 16 14 4 14
4b) Simplifying further we have the following base solution where, again, ix
is from (3):
xx <- x[ix]
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
ifelse(xx - x1 < x2 - xx, x1, x2)[order(ix)]
## [1] 3 16 14 4 14
The example in the question had no duplicates but if there were duplicates there is some question regarding the problem definition. For example if we had c(1, 3, 4, 1)
then if we look at the first value, 1, there is another value exactly equal to it so the closest value is 1. Another interpretation is that the closest value not equal to 1 should be returned which in this case is 3. In the codes above the sqldf
solution gives the closest value not equal to the current value whereas the others give the closest value among the remaining values.
If we wanted the interpretation of the closest not equal for those other than sqldf
then we could use rle
after ordering to compress it down to unique values and then use inverse.rle
afterwards as shown on the modified 4b:
x <- c(1, 3, 4, 1)
ix <- order(x)
r <- rle(x[ix])
xx <- r$values
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
r$values <- ifelse(xx - x1 < x2 - xx, x1, x2)
inverse.rle(r)[order(ix)]
## [1] 3 4 3 3
Upvotes: 8
Reputation: 502
I was very interested in this question and in the approaches suggested in the other responses, so I compared them with regard to their running time (and I added another approach using the package RANN
). The code is appended below. TL;DR: The base R version 4b by user G. Grothendieck was most efficient, and by a significant margin.
library(RANN)
library(zoo)
library(data.table)
library(Rfast)
library(sqldf)
# All functions take a vector as argument,
# and return the values of nearest neighbours (not their index)
# Using base R, by ThomasIsCoding
base_nn <- function(x) {
d <- data.frame(`diag<-`(as.matrix(dist(x)),Inf))
id <- unlist(Map(which.min,d))
x[id]
}
# Using Rfast, by G. Grothendieck
rfast_nn <- function(x) {
x[ dista(x, x, k = 2, index = TRUE)[, 2] ]
}
# Using sqldf, by G. Grothendieck
sqldf_nn <- function(x) {
DF <- data.frame(x) # x is from (1)
unname(
unlist(sqldf("select a.x, b.x nearest, min(abs(a.x - b.x))
from DF a
left join DF b on a.x != b.x
group by a.rowid")[2])
)
}
# Using `zoo`, by G. Grothendieck
zoo_nn <- function(x) {
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
rollapply(c(-Inf, x[ix], Inf), 3, least)[order(ix)]
}
# Using base R (v 4), by G. Grothendieck
base2_nn <- function(x) {
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
apply(embed(c(-Inf, x[ix], Inf), 3)[, 3:1], 1, least)[order(ix)]
}
# Using base R (v 4a), by G. Grothendieck
base3_nn <- function(x) {
ix <- order(x)
least <- function(x) if (x[2] - x[1] < x[3] - x[2]) x[1] else x[3]
-apply(embed(-c(-Inf, x[ix], Inf), 3), 1, least)[order(ix)]
}
# Using base R (v 4b), by G. Grothendieck
base4_nn <- function(x) {
ix <- order(x)
xx <- x[ix]
x1 <- c(-Inf, xx[-length(xx)])
x2 <- c(xx[-1], Inf)
ifelse(xx - x1 < x2 - xx, x1, x2)[order(ix)]
}
# Using data.table, by IceCreamToucan
dt_nn <- function(x) {
dt <- setkey(data.table(x), x)
dt[dt, on = .(x > x), mult = 'first', lowx := i.x][, lowx := fcoalesce(lowx + .0, -Inf)]
dt[dt, on = .(x < x), mult = 'last', highx := i.x][, highx := fcoalesce(highx + .0, Inf)]
dt[, closex := fifelse(x - lowx < highx - x, lowx, highx)]
unname(unlist(dt[, .(closex)]))
}
# Using, RANN, by me
rann_nn <- function(x) {
id <- RANN::nn2(as.matrix(x), k = 2)$nn.idx[, 2]
x[id]
}
### Apply all methods
# Test that all have the same output:
x <- c(4, 14,11,3,16)
rann_nn(x)
# [1] 3 16 14 4 14
base_nn(x)
# [1] 3 16 14 4 14
rfast_nn(x)
# [1] 3 16 14 4 14
sqldf_nn(x)
# [1] 3 16 14 4 14
zoo_nn(x)
# [1] 3 16 14 4 14
base2_nn(x)
# [1] 3 16 14 4 14
base3_nn(x)
# [1] 3 16 14 4 14
base4_nn(x)
# [1] 3 16 14 4 14
dt_nn(x) # differently ordered for some reason
# [1] 4 3 14 16 14
# Compare running times
library(microbenchmark)
# Compare for N = 1000 elements
benchmark_data <- rnorm(1000)
microbenchmark(
rann_nn(benchmark_data),
base_nn(benchmark_data),
rfast_nn(benchmark_data),
sqldf_nn(benchmark_data),
zoo_nn(benchmark_data),
base2_nn(benchmark_data),
base3_nn(benchmark_data),
base4_nn(benchmark_data),
dt_nn(benchmark_data)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# rann_nn(benchmark_data) 641.180 684.1975 776.5467 711.6680 775.3635 3822.023 100
# base_nn(benchmark_data) 166523.177 179240.8130 209471.1333 187633.0515 249740.8425 330864.712 100
# rfast_nn(benchmark_data) 45160.603 47032.5225 47681.0557 47594.0075 48308.8440 50579.839 100
# sqldf_nn(benchmark_data) 133916.594 138769.8175 143505.9315 140543.3250 143830.2765 211873.960 100
# zoo_nn(benchmark_data) 4359.359 4604.0275 5008.4291 4785.1515 5037.9705 14999.802 100
# base2_nn(benchmark_data) 1292.322 1407.4875 1747.8404 1462.7295 1588.1580 11297.321 100
# base3_nn(benchmark_data) 1263.644 1396.9210 1615.7495 1472.9940 1571.8575 11828.015 100
# base4_nn(benchmark_data) 119.543 146.1080 254.5075 178.1065 197.4265 7726.156 100
# dt_nn(benchmark_data) 5290.337 6580.6965 7111.1816 6892.3800 7351.3795 29469.815 100
# For N = 100000, leaving out the slowest versions (e.g., `base_nn()`
# no longer works because a distance matrix cannot be computed for
# N = 100000)
benchmark_data <- rnorm(100000)
microbenchmark(
rann_nn(benchmark_data),
zoo_nn(benchmark_data),
base2_nn(benchmark_data),
base3_nn(benchmark_data),
base4_nn(benchmark_data),
dt_nn(benchmark_data)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rann_nn(benchmark_data) 130.957025 141.02904 149.94052 148.60184 156.14506 271.1882 100
# zoo_nn(benchmark_data) 606.690004 673.88980 720.12545 717.51658 766.98190 886.4397 100
# base2_nn(benchmark_data) 142.554407 176.30358 198.58375 193.34812 212.33885 329.5470 100
# base3_nn(benchmark_data) 142.074126 168.78195 189.65122 184.45025 205.89414 287.0740 100
# base4_nn(benchmark_data) 9.354764 10.46687 17.22086 12.36354 14.22882 166.4758 100
# dt_nn(benchmark_data) 96.503882 104.06914 117.95408 108.20284 121.11428 247.2092 100
Upvotes: 4
Reputation: 28675
Option with a data.table non-equi-join
dt <- setkey(data.table(x), x)
dt[dt, on = .(x > x), mult = 'first', lowx := i.x][, lowx := fcoalesce(lowx + .0, -Inf)]
dt[dt, on = .(x < x), mult = 'last', highx := i.x][, highx := fcoalesce(highx + .0, Inf)]
dt[, closex := fifelse(x - lowx < highx - x, lowx, highx)]
dt[, .(x, closex)]
# x closex
# 1: 3 4
# 2: 4 3
# 3: 11 14
# 4: 14 16
# 5: 16 14
Upvotes: 3
Reputation: 101189
Here is a base R solution
d <- data.frame(`diag<-`(as.matrix(dist(x)),Inf))
ids <- unlist(Map(which.min,d))
val <- x[ids]
such that
> ids
X1 X2 X3 X4 X5
4 5 2 1 2
> val
[1] 3 16 14 4 14
DATA
x <- c(4, 14,11,3,16)
Upvotes: 3