MikeHuber
MikeHuber

Reputation: 665

Fastest way to find nearest value in vector

I have two integer/posixct vectors:

a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements

Now my resulting vector c should contain for each element of vector a the nearest element of b:

c <- c(4,4,4,4,4,6,6,...)

I tried it with apply and which.min(abs(a - b)) but it's very very slow.

Is there any more clever way to solve this? Is there a data.table solution?

Upvotes: 33

Views: 50996

Answers (7)

qfazille
qfazille

Reputation: 1671

# Function
Closest <- function(x, bands) {
  sapply(x, function(y) {
    bands[which.min(abs(bands - y))]
  })
}

# Be aware that when the value is right between to "bands", then the first one is provided
# The lines below don't return the same
Closest(x = c(0, 25000, 25001, 24999, 53000, 159000), bands = c(0, 50000, 100000))
Closest(x = c(0, 25000, 25001, 24999, 53000, 159000), bands = c(100000, 50000, 0))

Upvotes: 0

Mehrad
Mehrad

Reputation: 3829

As it is presented in this link you can do either:

which(abs(x - your.number) == min(abs(x - your.number)))

or

which.min(abs(x - your.number))

where x is your vector and your.number is the value. If you have a matrix or data.frame, simply convert them to numeric vector with appropriate ways and then try this on the resulting numeric vector.

For example:

x <- 1:100
your.number <- 21.5
which(abs(x - your.number) == min(abs(x - your.number)))

would output:

[1] 21 22

Update: Based on the very kind comment of hendy I have added the following to make it more clear:

Note that the answer above (i.e 21 and 22) are the indexes if the items (this is how which() works in R), so if you want to get the actual values, you have use these indexes to get the value. Let's have another example:

x <- seq(from = 100, to = 10, by = -5)
x
[1] 100  95  90  85  80  75  70  65  60  55  50  45  40  35  30  25  20  15  10

Now let's find the number closest to 42:

your.number <- 42
target.index <- which(abs(x - your.number) == min(abs(x - your.number)))
x[target.index]

which would output the "value" we are looking for from the x vector:

[1] 40

Upvotes: 53

ThomasIsCoding
ThomasIsCoding

Reputation: 101189

Here might be a simple base R option, using max.col + outer:

b[max.col(-abs(outer(a,b,"-")))]

which gives

> b[max.col(-abs(outer(a,b,"-")))]
 [1]  4  4  4  4  6  6  6 10 10 10 10 10 16 16 16

Upvotes: 3

Rohit Mishra
Rohit Mishra

Reputation: 571

library(data.table)

a=data.table(Value=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))

a[,merge:=Value]

b=data.table(Value=c(4,6,10,16))

b[,merge:=Value]

setkeyv(a,c('merge'))

setkeyv(b,c('merge'))

Merge_a_b=a[b,roll='nearest']

In the Data table when we merge two data table, there is an option called nearest which put all the element in data table a to the nearest element in data table b. The size of the resultant data table will be equal to the size of b (whichever is within the bracket). It requires a common key for merging as usual.

Upvotes: 9

Katarzyna Paczkowska
Katarzyna Paczkowska

Reputation: 69

For those who would be satisfied with the slow solution:

sapply(a, function(a, b) {b[which.min(abs(a-b))]}, b)

Upvotes: 6

morgan121
morgan121

Reputation: 2253

Late to the party, but there is now a function from the DescTools package called Closest which does almost exactly what you want (it just doesn't do multiple at once)

To get around this we can lapply over your a list, and find the closest.

library(DescTools)

lapply(a, function(i) Closest(x = b, a = i))

You might notice that more values are being returned than exist in a. This is because Closest will return both values if the value you are testing is exactly between two (e.g. 3 is exactly between 1 and 5, so both 1 and 5 would be returned).

To get around this, put either min or max around the result:

lapply(a, function(i) min(Closest(x = b, a = i)))
lapply(a, function(i) max(Closest(x = b, a = i)))

Then unlist the result to get a plain vector :)

Upvotes: 1

asachet
asachet

Reputation: 6921

Not quite sure how it will behave with your volume but cut is quite fast.

The idea is to cut your vector a at the midpoints between the elements of b.

Note that I am assuming the elements in b are strictly increasing!

Something like this:

a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements

cuts <- c(-Inf, b[-1]-diff(b)/2, Inf)
# Will yield: c(-Inf, 5, 8, 13, Inf)

cut(a, breaks=cuts, labels=b)
# [1] 4  4  4  4  4  6  6  6  10 10 10 10 10 16 16
# Levels: 4 6 10 16

This is even faster using a lower-level function like findInterval (which, again, assumes that breakpoints are non-decreasing).

findInterval(a, cuts)
[1] 1 1 1 1 2 2 2 3 3 3 3 3 4 4 4

So of course you can do something like:

index = findInterval(a, cuts)
b[index]
# [1]  4  4  4  4  6  6  6 10 10 10 10 10 16 16 16

Note that you can choose what happens to elements of a that are equidistant to an element of b by passing the relevant arguments to cut (or findInterval), see their help page.

Upvotes: 13

Related Questions