s__
s__

Reputation: 9485

R, compute the smallest Euclidean Distance for two dataset, and label it automatically

I'm working with Euclidean Distance with a pair of dataset. First of all, my data.

centers <- data.frame(x_ce = c(300,180,450,500),
                      y_ce = c(23,15,10,20),
                      center = c('a','b','c','d'))

points <- data.frame(point = c('p1','p2','p3','p4'),
                     x_p = c(160,600,400,245),
                     y_p = c(7,23,56,12))

My goal is to find, for each point in points, the smallest distance from all the center in centers, and append the center name to the points dataset (clearly the smallest one's), and make this procedure automatic.

So I started with the base:

#Euclidean distance
sqrt(sum((x-y)^2))

The fact that I have in my mind how it should work, but I cannot manage how to make it automatic.

  1. choose one row of points, and all the rows of centers
  2. calculate the Euclidean Distance between the row and each row of centers
  3. choose the smallest distance
  4. attach the label of the smallest distance
  5. repeat for the second row ... till the end of points

So I managed to do it manually, to have all the steps to make it automatic:

# 1.  
x = (points[1,2:3])   # select the first of points
y1 = (centers[1,1:2]) # select the first center
y2 = (centers[2,1:2]) # select the second center
y3 = (centers[3,1:2]) # select the third center
y4 = (centers[4,1:2]) # select the fourth center

# 2.
# then the distances
distances <- data.frame(distance = c(
                                    sqrt(sum((x-y1)^2)),
                                    sqrt(sum((x-y2)^2)),
                                    sqrt(sum((x-y3)^2)),
                                    sqrt(sum((x-y4)^2))),
                                    center = centers$center
                                    )

# 3.
# then I choose the row with the smallest distance
d <- distances[which(distances$distance==min(distances$distance)),]

# 4.
# last, I put the label near the point
cbind(points[1,],d)

# 5. 
# then I restart for the second point

The problem is that I cannot manage it automatically. have you got any idea to make this procedure automatic for each points of points? Furthermore, am I reinventing the wheel, i.e. does it exist a faster procedure (as a function) that I don't know?

Upvotes: 3

Views: 1079

Answers (2)

WaltS
WaltS

Reputation: 5530

With the dplyr package, you can use group_by to loop over each point and mutate to form a list of distances, set distance as the min of the list, and set center as the name of the minimum distance center. I've included two alternatives for the cases of duplicate rows or point names.

    library(dplyr)
   centers <- data.frame(x_ce = c(300,180,450,500),
                        y_ce = c(23,15,10,20),
                        center = c('a','b','c','d'))
   points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
                       x_p = c(160,600,400,245, 245),
                       y_p = c(7,23,56,12, 12))
#
#  If duplicate rows need to be removed
#
  result1 <- points %>% group_by(point) %>%  distinct() %>% 
                                  mutate(lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ), 
                                  distance=min(unlist(lst)), 
                                  center = centers$center[which.min(unlist(lst))]) %>%
             select(-lst)

which gives the result

# A tibble: 4 x 5
# Groups:   point [4]
  point   x_p   y_p distance center
  <fct> <dbl> <dbl>    <dbl> <fct> 
1 p1      160     7     21.5 b     
2 p2      600    23    100.  d     
3 p3      400    56     67.9 c     
4 p4      245    12     56.1 a 

and

#
# Alternative if point names are not unique
#
  points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
                       x_p = c(160,600,400,245, 550),
                       y_p = c(7,23,56,12, 25))
  result2 <- points %>% rowwise() %>%
                    mutate( lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ), 
                               distance=min(unlist(lst)), 
                              center = centers$center[which.min(unlist(lst))]) %>%
                    ungroup() %>% select(-lst)

with the result

# A tibble: 5 x 5
  point   x_p   y_p distance center
  <fct> <dbl> <dbl>    <dbl> <fct> 
1 p1      160     7     21.5 b     
2 p2      600    23    100.  d     
3 p3      400    56     67.9 c     
4 p4      245    12     56.1 a     
5 p4      550    25     50.2 d    

Upvotes: 2

AntoniosK
AntoniosK

Reputation: 16121

centers <- data.frame(x_ce = c(300,180,450,500),
                      y_ce = c(23,15,10,20),
                      center = c('a','b','c','d'))

points <- data.frame(point = c('p1','p2','p3','p4'),
                     x_p = c(160,600,400,245),
                     y_p = c(7,23,56,12))

library(tidyverse)

points %>%
  mutate(c = list(centers)) %>%
  unnest() %>%                       # create all posible combinations of points and centers as a dataframe
  rowwise() %>%                      # for each combination
  mutate(d = sqrt(sum((c(x_p,y_p)-c(x_ce,y_ce))^2))) %>%   # calculate distance
  ungroup() %>%                                            # forget the grouping
  group_by(point, x_p, y_p) %>%                            # for each point
  summarise(closest_center = center[d == min(d)]) %>%      # keep the closest center
  ungroup()                                                # forget the grouping

# # A tibble: 4 x 4
#   point   x_p   y_p closest_center
#   <fct> <dbl> <dbl> <fct>         
# 1 p1      160     7 b             
# 2 p2      600    23 d             
# 3 p3      400    56 c             
# 4 p4      245    12 a

Upvotes: 2

Related Questions