Reputation: 429
Imagine a small dataset of xy coordinates. These points are grouped by a variable called indexR, there are 3 groups in total. All xy coordinates are in the same units. The data looks approximately like so:
# A tibble: 61 x 3
indexR x y
<dbl> <dbl> <dbl>
1 1 837 924
2 1 464 661
3 1 838 132
4 1 245 882
5 1 1161 604
6 1 1185 504
7 1 853 870
8 1 1048 859
9 1 1044 514
10 1 141 938
# ... with 51 more rows
The goal is to determine which 3 points, one from each group, are closest to each other, in the sense of minimizing the sum of the pairwise distances between selected points.
I have attempted this by considering euclidian distances, as follows. (Credit goes to @Mouad_S, in this thread, and https://gis.stackexchange.com/questions/233373/distance-between-coordinates-in-r)
#dput provided at bottom of this post
> df$dummy = 1
> df %>%
+ full_join(df, c("dummy" = "dummy")) %>%
+ full_join(df, c("dummy" = "dummy")) %>%
+ filter(indexR.x != indexR.y & indexR.x != indexR & indexR.y != indexR) %>%
+ mutate(dist =
+ ((.$x - .$x.x)^2 + (.$y- .$y.x)^2)^.5 +
+ ((.$x - .$x.y)^2 + (.$y- .$y.y)^2)^.5 +
+ ((.$x.x - .$x.y)^2 + (.$y.x- .$y.y)^2)^.5,
+ dist = round(dist, digits = 0)) %>%
+ arrange(dist) %>%
+ filter(dist == min(dist))
# A tibble: 6 x 11
indexR.x x.x y.x dummy indexR.y x.y y.y indexR x y dist
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 638 324 1 2 592 250 3 442 513 664
2 1 638 324 1 3 442 513 2 592 250 664
3 2 592 250 1 1 638 324 3 442 513 664
4 2 592 250 1 3 442 513 1 638 324 664
5 3 442 513 1 1 638 324 2 592 250 664
6 3 442 513 1 2 592 250 1 638 324 664
From this we can identify the three points closest together (minimum distance apart; enlarged on the figure below). However, the challenge comes when extending this such that indexR has 4,5 ... n groups. The problem is in finding a more practical or optimised method for making this calculation.
structure(list(indexR = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 3, 3), x = c(836.65, 464.43, 838.12, 244.68, 1160.86,
1184.52, 853.4, 1047.96, 1044.2, 141.06, 561.01, 1110.74, 123.4,
1087.24, 827.83, 100.86, 140.07, 306.5, 267.83, 1118.61, 155.04,
299.52, 543.5, 782.25, 737.1, 1132.14, 659.48, 871.78, 1035.33,
867.81, 192.94, 1167.8, 1099.59, 1097.3, 1089.78, 1166.59, 703.33,
671.64, 346.49, 440.89, 126.38, 638.24, 972.32, 1066.8, 775.68,
591.86, 818.75, 953.63, 1104.98, 1050.47, 722.43, 1022.17, 986.38,
1133.01, 914.27, 725.15, 1151.52, 786.08, 1024.83, 246.52, 441.53
), y = c(923.68, 660.97, 131.61, 882.23, 604.09, 504.05, 870.35,
858.51, 513.5, 937.7, 838.47, 482.69, 473.48, 171.78, 774.99,
792.46, 251.26, 757.95, 317.71, 401.93, 326.32, 725.89, 98.43,
414.01, 510.16, 973.61, 445.33, 504.54, 669.87, 598.75, 225.27,
789.45, 135.31, 935.51, 270.38, 241.19, 595.05, 401.25, 160.98,
778.86, 192.17, 323.76, 361.08, 444.92, 354, 249.57, 301.64,
375.75, 440.03, 428.79, 276.5, 408.84, 381.14, 459.14, 370.26,
304.05, 439.14, 339.91, 435.85, 759.42, 513.37)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -61L), .Names = c("indexR",
"x", "y"))
Upvotes: 2
Views: 5422
Reputation: 103
I developed a simple algorithm to quickly solve this problem. The first step is to overlay a grid on the entire area of points. The first step is to assign each point from each group to the cell or unit square where it is located. Next we go to the lower left corner of the graph and go over one cell and up one cell. This is the starting cell. Then we define a region of interest consisting of this cell and all of its 8 neighbors. Then a test is made to determine whether or not at least one point from each of the groups is within this 9 cell region. If so then the distance from each point represented in this region from each of the groups of points to all other points from all other groups is calculated. In other words all combinations of points in this 9-cell region are used to get a total distance where paired points for distance calculation are never from the same group. From these calculations the one with the minimum distance involving a single point from each group is saved as a possible solution. Then this entire process is repeated by going over one cell to the right. Each 9-cell region is calculated as the central cell moves on to the right. This is stopped one cell from the right end. When the first row is completed the process proceeds by going up one row and starting again at the left but one cell over again. Thus the each cell has been considered when the top row is finished. The solution will be the minimum distance computed from all the tests done for each 9-cell region.
The reason we consider a 9-cell region and not just go cell-by-cell is that we could miss closely spaced points from different groups that are located in the corners of cells.
It's important to choose the correct cell or grid size. If the cells are too small then no possible solution will be found because none of the regions will encompass at least one point from each group. If the cells are too large then there will be many points from each group and calculation time will be excessive. Fortunately this optimal cell size can be quickly found through trial and error.
I've run this algorithm multiple times with varying number of groups and number of points in a group. For randomly scattered points in all groups I found that a 15 x 15 grid size works well for a 10 group - 400 point (40 points per group) case. That example runs in under one second.
Upvotes: 0
Reputation: 77454
You cannot afford to enumerate all possible solutions, and I don't see any obvious shortcut.
So I guess you'll have to do a branch and bound optimization approach.
First guess a reasonably good solution. Like the closest two points with different labels. Then add the nearest with a different label until you have all labels covered.
Now do some trivial optimization: for every label, try if there is some point that you can use instead of the current point to improve the result. Stop when you can't find any further improvement.
For this initial guess, compute the distances. This will give you an upper bound, which allows you to stop your search early. You can also compute a lower bound, the sum of all best two-label solutions.
Now you can try to remove points, where the nearest neighbors of each label + the lower bounds for all other labels is already worse than your initial solution. This will hopefully eliminate a lot of points.
Then you can start enumerating solutions (probably begin with the smallest labels first), but stop recursion whenever the current solution + the remaining lower bounds are larger than your best known solution (branch and bound).
You can also try sorting points e.g. by minimum distance to the remaining labels, to hopefully find better bounds fast.
I'd certainly not choose R to implement this...
Upvotes: 1
Reputation: 44299
One possibility would be to formulate the problem of identifying the closest elements, one from each group, as a mixed integer program. We could define decision variables y_i for whether each point i is selected, as well as x_{ij} for whether points i and j are both selected (x_{ij} = y_iy_j). We need to select one element from each group.
In practice, you could implement this mixed integer program using the lpSolve
package (or one of the other R optimization packages).
opt.closest <- function(df) {
# Compute every pair of indices
library(dplyr)
pairs <- as.data.frame(t(combn(nrow(df), 2))) %>%
mutate(G1=df$indexR[V1], G2=df$indexR[V2]) %>%
filter(G1 != G2) %>%
mutate(dist = sqrt((df$x[V1]-df$x[V2])^2+(df$y[V1]-df$y[V2])^2))
# Compute a few convenience values
n <- nrow(df)
nP <- nrow(pairs)
groups <- sort(unique(df$indexR))
nG <- length(groups)
gpairs <- combn(groups, 2)
nGP <- ncol(gpairs)
# Solve the optimization problem
obj <- c(pairs$dist, rep(0, n))
constr <- rbind(cbind(diag(nP), -outer(pairs$V1, seq_len(n), "==")),
cbind(diag(nP), -outer(pairs$V2, seq_len(n), "==")),
cbind(diag(nP), -outer(pairs$V1, seq_len(n), "==") - outer(pairs$V2, seq_len(n), "==")),
cbind(matrix(0, nG, nP), outer(groups, df$indexR, "==")),
cbind((outer(gpairs[1,], pairs$G1, "==") &
outer(gpairs[2,], pairs$G2, "==")) |
(outer(gpairs[2,], pairs$G1, "==") &
outer(gpairs[1,], pairs$G2, "==")), matrix(0, nGP, n)))
dir <- rep(c("<=", ">=", "="), c(2*nP, nP, nG+nGP))
rhs <- rep(c(0, -1, 1), c(2*nP, nP, nG+nGP))
library(lpSolve)
mod <- lp("min", obj, constr, dir, rhs, all.bin=TRUE)
which(tail(mod$solution, n) == 1)
}
This can compute the closest 3 points, one from each cluster, in your example dataset:
df[opt.closest(df),]
# A tibble: 3 x 3
# indexR x y
# <dbl> <dbl> <dbl>
# 1 1 638.24 323.76
# 2 2 591.86 249.57
# 3 3 441.53 513.37
It can also compute the best possible solution for datasets with more points and groups. Here are the runtimes for datasets with 7 groups each and 100 and 200 points:
make.dataset <- function(n, nG) {
set.seed(144)
data.frame(indexR = sample(seq_len(nG), n, replace=T), x = rnorm(n), y=rnorm(n))
}
df100 <- make.dataset(100, 7)
system.time(opt.closest(df100))
# user system elapsed
# 11.536 2.656 15.407
df200 <- make.dataset(200, 7)
system.time(opt.closest(df200))
# user system elapsed
# 187.363 86.454 323.167
This is far from instantaneous -- it takes 15 seconds for the 100-point, 7-group dataset and 323 seconds for the 200-point, 7-group dataset. Still, it is much quicker than iterating through all 92 million 7-tuples in the 100-point dataset or all 13.8 billion 7-tuples in the 200-point dataset. You could set a runtime limit with a solver like the one from the Rglpk package to get the best solution obtained within that limit.
Upvotes: 1
Reputation: 2716
you can use cross joins to have all the points combinations, calculate the total distance between all three points, then take the minimum of that.
df$id <- row.names(df) # to create ID's for the points
df2 <- merge(df, df, by = NULL ) # the first cross join
df3 <- merge(df2, df, by = NULL) # the second cross join
# eliminating rows where the points are of the same indexR
df3 <- df3[df3$indexR.x != df3$indexR.y & df3$indexR.x != df3$indexR
& df3$indexR.y != df3$indexR,]
## calculating the total distance
df3$total_distance <- ((df3$x - df3$x.x)^2 + (df3$y- df3$y.x)^2)^.5 +
((df3$x - df3$x.y)^2 + (df3$y- df3$y.y)^2)^.5 +
((df3$x.x - df3$x.y)^2 + (df3$y.x- df3$y.y)^2)^.5
## minimum distance
df3[which.min(df3$total_distance),]
indexR.x x.x y.x id.x indexR.y x.y y.y id.y indexR x y id total_distance
155367 3 441.53 513.37 61 2 591.86 249.57 46 1 638.24 323.76 42 664.3373
Upvotes: 0