Reputation: 85
I want to do the following in R: for each element in vector X, I want to nearest neighbor in vector Y, such that the sum of the absolute differences between each X-Y match is minimized. Vector Y is at least as long as vector X.
The catch is: I want to do this WITHOUT replacement. For example, given:
X= c(3, 6)
Y= c(1, 2, 4, 10),
I would like to obtain Z= c(2, 4)
because matching 3 to 2, and 6 to 4, creates a smaller total distance than matching 3 to 4, and 6 to 10.
*This is my first stack question, so apologies in advance for any mistakes I have made in asking the question.
Update: To use @merv's more illustrative example and terminology, I am looking for the global optimum of matches, and NOT the local optimum (first/greedy matches). For example, if X= c(3,7)
and Y= c(1,4,12)
, I want to obtain Z= c(1, 4)
, which has a Manhattan distance of 5. I do NOT want the first/greedy match, which would be Z= c(4, 12)
--this would be obtained by finding the closest match for 3, and subsequently, the closest match for 7.
Upvotes: 4
Views: 1837
Reputation: 8890
As noted by Amol, this is exactly what the hungarian algorithm aims to do: find optimal pairs while minimizing the global cost. All you need to do is to specify a cost matrix, which I take here as the L1/L2 distance between the points.
Replicating OP's second example,using RcppHungarian
, one gets the same solution Z= c(1, 4)
and the same minimal cost of 5
:
library(RcppHungarian)
X= c(3,7)
Y= c(1,4,12)
D <- outer(X, Y, function(x, y) abs(x-y))
out <- HungarianSolver(D)
out
#> $cost
#> [1] 5
#>
#> $pairs
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 2
Y[out$pairs[,2]]
#> [1] 1 4
Created on 2021-11-23 by the reprex package (v2.0.1)
Upvotes: 1
Reputation: 21
This is an optimization problem. What you need is to use hungarian algorithm, it does exactly what you looking for.
Upvotes: 1
Reputation: 76820
If you can assume that most inputs to this are going to be small in size, then the simplest approach is to expand all possible combinations of the search space.
uniqueNearestNeighbor <- function (X, Y) {
zs <- combn(Y, length(X))
dists <- apply(zs, 2, function (z) sum(abs(X - z)))
return(zs[,which.min(dists)])
}
Note that this assumes your vectors are both sorted.
> uniqueNearestNeighbor(c(3, 7), c(1, 4, 12))
[1] 1 4
If you have a large search space (Y
), but a low dimensional input (X
), you could prune the search space to help limit the number of combinations. For example, you can safely exclude all points in Y
that aren't at least a k-th nearest neighbor of a point in X
, where k is the dimension of X
.
If you do have a large search space and pruning isn't enough to slim down the problem, or if you will be repeatedly computing this and it becomes a clear bottleneck, you'll want to resort to a more sophisticated approach. Off the top of my head, I think the A* algorithm seems like it is an appropriate fit to the problem. For an admissible heuristic function, one can use the sum of the distances of each point in X
to its nearest neighbor in Y
. At each iteration, assign one point in X
to its nearest neighbor, then proceed down the tree with that point and its assignee removed. If a given x
in X
has multiple nearest neighbors (e.g., x = 2
and Y
contains 1 and 3), one must include both options in the search space.
This will arrive at a global optimum, due to the provable property that given any X
and Y
, for all global optima, at least one x
in X
gets assigned to its nearest neighbor in Y
. Hence, the described tree will contain all possible global optima, and since A* is a breadth-first search, one of these is guaranteed to be found.
If you need to go this route, it might also be worth asking on cs.stackexchange.com, since there may be more appropriate algorithms.
Upvotes: 2