Omry Atia
Omry Atia

Reputation: 2443

In R's base/dplyr, find the row closest to each row of a data set

I have a data frame with N rows, and I would like to calculate, for a subset of the rows, what is the closest row to each of them in the data set that belong to the same group.

So for example:

> df
# A tibble: 8,014 x 4
     A      B       C      Group
    <dbl>  <dbl>   <dbl>    <int>
 1  -0.396 -0.621 -0.759      1
 2  -0.451 -0.625 -0.924      1
 3  -0.589 -0.624 -1.26       1
 4  -0.506 -0.625 -1.09       1
 5  NA      1.59  -0.593      1
 6  -0.286  4.22  -0.0952     1
 7  NA      2.91  -0.0952     1
 8  NA      4.22  -0.924      1
 9  -0.175  1.52  -0.0952     1
10  NA      1.74   1.56       1
# ... with 8,004 more rows

So for example I would like to check which are the closest rows to row 2 and row 3 that belong to Group ==1. Also, I have to do this efficiently, so a for loop is not really an option.

I would like to use the dist function because it has the nice feature of handling NA's properly, but I don't need to calculate the entire distance matrix - this would be a waste.

I tried this but it failed, and is also wasteful:

res = Map(function(x,y) dist(as.matrix(rbind(x, y))), df[2:3, ] 
%>% group_by(Group), df %>% group_by(Group))

Upvotes: 0

Views: 115

Answers (1)

David Klotz
David Klotz

Reputation: 2431

One way to do this, but it does create the entire distance matrix for each group. Not sure why that's wasteful, considering what you're trying to do:

library(tidyverse)
library(purrr)

min_dist <- function(x){

  dist(x, upper = T) %>% 
    as.matrix %>% 
    as.tibble %>% 
    na_if(0) %>%  #as.tibble adds zeros along the diagonal, so this removes them
    summarize_all(funs(which(. == min(.,na.rm=TRUE)))) %>% 
    gather %>% 
    pull(value)
}


df %>% group_by(Group) %>%
  mutate(group_row = row_number()) %>%
  nest(-Group) %>% 
  mutate(nearest_row = map(data, min_dist)) %>% 
  unnest

Upvotes: 1

Related Questions