Quinn
Quinn

Reputation: 429

Find shortest distance between multiple points

Imagine a small dataset of xy coordinates. These points are grouped by a variable called indexR, there are 3 groups in total. All xy coordinates are in the same units. The data looks approximately like so:

# A tibble: 61 x 3
   indexR     x     y
    <dbl> <dbl> <dbl>
 1      1   837   924
 2      1   464   661
 3      1   838   132
 4      1   245   882
 5      1  1161   604
 6      1  1185   504
 7      1   853   870
 8      1  1048   859
 9      1  1044   514
10      1   141   938
# ... with 51 more rows

The goal is to determine which 3 points, one from each group, are closest to each other, in the sense of minimizing the sum of the pairwise distances between selected points.

I have attempted this by considering euclidian distances, as follows. (Credit goes to @Mouad_S, in this thread, and https://gis.stackexchange.com/questions/233373/distance-between-coordinates-in-r)

#dput provided at bottom of this post
> df$dummy = 1
> df %>% 
+   full_join(df, c("dummy" = "dummy")) %>% 
+   full_join(df, c("dummy" = "dummy")) %>%
+   filter(indexR.x != indexR.y & indexR.x != indexR & indexR.y != indexR) %>% 
+   mutate(dist = 
+            ((.$x - .$x.x)^2 + (.$y- .$y.x)^2)^.5 +
+            ((.$x - .$x.y)^2 + (.$y- .$y.y)^2)^.5 +
+            ((.$x.x - .$x.y)^2 + (.$y.x- .$y.y)^2)^.5,
+          dist = round(dist, digits = 0)) %>%
+   arrange(dist) %>%
+   filter(dist == min(dist))
# A tibble: 6 x 11
  indexR.x   x.x   y.x dummy indexR.y   x.y   y.y indexR     x     y  dist
     <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
1        1   638   324     1        2   592   250      3   442   513   664
2        1   638   324     1        3   442   513      2   592   250   664
3        2   592   250     1        1   638   324      3   442   513   664
4        2   592   250     1        3   442   513      1   638   324   664
5        3   442   513     1        1   638   324      2   592   250   664
6        3   442   513     1        2   592   250      1   638   324   664

From this we can identify the three points closest together (minimum distance apart; enlarged on the figure below). However, the challenge comes when extending this such that indexR has 4,5 ... n groups. The problem is in finding a more practical or optimised method for making this calculation.

enter image description here

structure(list(indexR = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 3, 3), x = c(836.65, 464.43, 838.12, 244.68, 1160.86, 
1184.52, 853.4, 1047.96, 1044.2, 141.06, 561.01, 1110.74, 123.4, 
1087.24, 827.83, 100.86, 140.07, 306.5, 267.83, 1118.61, 155.04, 
299.52, 543.5, 782.25, 737.1, 1132.14, 659.48, 871.78, 1035.33, 
867.81, 192.94, 1167.8, 1099.59, 1097.3, 1089.78, 1166.59, 703.33, 
671.64, 346.49, 440.89, 126.38, 638.24, 972.32, 1066.8, 775.68, 
591.86, 818.75, 953.63, 1104.98, 1050.47, 722.43, 1022.17, 986.38, 
1133.01, 914.27, 725.15, 1151.52, 786.08, 1024.83, 246.52, 441.53
), y = c(923.68, 660.97, 131.61, 882.23, 604.09, 504.05, 870.35, 
858.51, 513.5, 937.7, 838.47, 482.69, 473.48, 171.78, 774.99, 
792.46, 251.26, 757.95, 317.71, 401.93, 326.32, 725.89, 98.43, 
414.01, 510.16, 973.61, 445.33, 504.54, 669.87, 598.75, 225.27, 
789.45, 135.31, 935.51, 270.38, 241.19, 595.05, 401.25, 160.98, 
778.86, 192.17, 323.76, 361.08, 444.92, 354, 249.57, 301.64, 
375.75, 440.03, 428.79, 276.5, 408.84, 381.14, 459.14, 370.26, 
304.05, 439.14, 339.91, 435.85, 759.42, 513.37)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -61L), .Names = c("indexR", 
"x", "y"))

Upvotes: 2

Views: 5422

Answers (4)

Bob Bixler
Bob Bixler

Reputation: 103

I developed a simple algorithm to quickly solve this problem. The first step is to overlay a grid on the entire area of points. The first step is to assign each point from each group to the cell or unit square where it is located. Next we go to the lower left corner of the graph and go over one cell and up one cell. This is the starting cell. Then we define a region of interest consisting of this cell and all of its 8 neighbors. Then a test is made to determine whether or not at least one point from each of the groups is within this 9 cell region. If so then the distance from each point represented in this region from each of the groups of points to all other points from all other groups is calculated. In other words all combinations of points in this 9-cell region are used to get a total distance where paired points for distance calculation are never from the same group. From these calculations the one with the minimum distance involving a single point from each group is saved as a possible solution. Then this entire process is repeated by going over one cell to the right. Each 9-cell region is calculated as the central cell moves on to the right. This is stopped one cell from the right end. When the first row is completed the process proceeds by going up one row and starting again at the left but one cell over again. Thus the each cell has been considered when the top row is finished. The solution will be the minimum distance computed from all the tests done for each 9-cell region.

The reason we consider a 9-cell region and not just go cell-by-cell is that we could miss closely spaced points from different groups that are located in the corners of cells.

It's important to choose the correct cell or grid size. If the cells are too small then no possible solution will be found because none of the regions will encompass at least one point from each group. If the cells are too large then there will be many points from each group and calculation time will be excessive. Fortunately this optimal cell size can be quickly found through trial and error.

I've run this algorithm multiple times with varying number of groups and number of points in a group. For randomly scattered points in all groups I found that a 15 x 15 grid size works well for a 10 group - 400 point (40 points per group) case. That example runs in under one second.

Upvotes: 0

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

You cannot afford to enumerate all possible solutions, and I don't see any obvious shortcut.

So I guess you'll have to do a branch and bound optimization approach.

First guess a reasonably good solution. Like the closest two points with different labels. Then add the nearest with a different label until you have all labels covered.

Now do some trivial optimization: for every label, try if there is some point that you can use instead of the current point to improve the result. Stop when you can't find any further improvement.

For this initial guess, compute the distances. This will give you an upper bound, which allows you to stop your search early. You can also compute a lower bound, the sum of all best two-label solutions.

Now you can try to remove points, where the nearest neighbors of each label + the lower bounds for all other labels is already worse than your initial solution. This will hopefully eliminate a lot of points.

Then you can start enumerating solutions (probably begin with the smallest labels first), but stop recursion whenever the current solution + the remaining lower bounds are larger than your best known solution (branch and bound).

You can also try sorting points e.g. by minimum distance to the remaining labels, to hopefully find better bounds fast.

I'd certainly not choose R to implement this...

Upvotes: 1

josliber
josliber

Reputation: 44299

One possibility would be to formulate the problem of identifying the closest elements, one from each group, as a mixed integer program. We could define decision variables y_i for whether each point i is selected, as well as x_{ij} for whether points i and j are both selected (x_{ij} = y_iy_j). We need to select one element from each group.

In practice, you could implement this mixed integer program using the lpSolve package (or one of the other R optimization packages).

opt.closest <- function(df) {
  # Compute every pair of indices
  library(dplyr)
  pairs <- as.data.frame(t(combn(nrow(df), 2))) %>%
    mutate(G1=df$indexR[V1], G2=df$indexR[V2]) %>%
    filter(G1 != G2) %>%
    mutate(dist = sqrt((df$x[V1]-df$x[V2])^2+(df$y[V1]-df$y[V2])^2))

  # Compute a few convenience values
  n <- nrow(df)
  nP <- nrow(pairs)
  groups <- sort(unique(df$indexR))
  nG <- length(groups)
  gpairs <- combn(groups, 2)
  nGP <- ncol(gpairs)

  # Solve the optimization problem
  obj <- c(pairs$dist, rep(0, n))
  constr <- rbind(cbind(diag(nP), -outer(pairs$V1, seq_len(n), "==")),
                  cbind(diag(nP), -outer(pairs$V2, seq_len(n), "==")),
                  cbind(diag(nP), -outer(pairs$V1, seq_len(n), "==") - outer(pairs$V2, seq_len(n), "==")),
                  cbind(matrix(0, nG, nP), outer(groups, df$indexR, "==")),
                  cbind((outer(gpairs[1,], pairs$G1, "==") &
                         outer(gpairs[2,], pairs$G2, "==")) |
                        (outer(gpairs[2,], pairs$G1, "==") &
                         outer(gpairs[1,], pairs$G2, "==")), matrix(0, nGP, n)))
  dir <- rep(c("<=", ">=", "="), c(2*nP, nP, nG+nGP))
  rhs <- rep(c(0, -1, 1), c(2*nP, nP, nG+nGP))
  library(lpSolve)
  mod <- lp("min", obj, constr, dir, rhs, all.bin=TRUE)
  which(tail(mod$solution, n) == 1)
}

This can compute the closest 3 points, one from each cluster, in your example dataset:

df[opt.closest(df),]
# A tibble: 3 x 3
#   indexR      x      y
#    <dbl>  <dbl>  <dbl>
# 1      1 638.24 323.76
# 2      2 591.86 249.57
# 3      3 441.53 513.37

It can also compute the best possible solution for datasets with more points and groups. Here are the runtimes for datasets with 7 groups each and 100 and 200 points:

make.dataset <- function(n, nG) {
  set.seed(144)
  data.frame(indexR = sample(seq_len(nG), n, replace=T), x = rnorm(n), y=rnorm(n))
}
df100 <- make.dataset(100, 7)
system.time(opt.closest(df100))
#    user  system elapsed 
#  11.536   2.656  15.407 
df200 <- make.dataset(200, 7)
system.time(opt.closest(df200))
#    user  system elapsed 
# 187.363  86.454 323.167 

This is far from instantaneous -- it takes 15 seconds for the 100-point, 7-group dataset and 323 seconds for the 200-point, 7-group dataset. Still, it is much quicker than iterating through all 92 million 7-tuples in the 100-point dataset or all 13.8 billion 7-tuples in the 200-point dataset. You could set a runtime limit with a solver like the one from the Rglpk package to get the best solution obtained within that limit.

Upvotes: 1

Mouad_Seridi
Mouad_Seridi

Reputation: 2716

you can use cross joins to have all the points combinations, calculate the total distance between all three points, then take the minimum of that.

df$id <- row.names(df) # to create ID's for the points 

df2 <- merge(df, df, by = NULL ) # the first cross join 

df3 <- merge(df2, df, by = NULL)  # the second cross join 



#  eliminating rows where the points are of the same indexR

df3 <- df3[df3$indexR.x != df3$indexR.y & df3$indexR.x != df3$indexR 
           & df3$indexR.y != df3$indexR,]


## calculating the total distance 

df3$total_distance <- ((df3$x - df3$x.x)^2 + (df3$y- df3$y.x)^2)^.5 +
  ((df3$x - df3$x.y)^2 + (df3$y- df3$y.y)^2)^.5 +
  ((df3$x.x - df3$x.y)^2 + (df3$y.x- df3$y.y)^2)^.5

## minimum distance 

df3[which.min(df3$total_distance),]

indexR.x    x.x    y.x id.x indexR.y    x.y    y.y id.y indexR      x      y id total_distance
155367        3 441.53 513.37   61        2 591.86 249.57   46      1 638.24 323.76 42       664.3373

Upvotes: 0

Related Questions