Reputation: 2443
I have a data frame with N rows, and I would like to calculate, for a subset of the rows, what is the closest row to each of them in the data set that belong to the same group.
So for example:
> df
# A tibble: 8,014 x 4
A B C Group
<dbl> <dbl> <dbl> <int>
1 -0.396 -0.621 -0.759 1
2 -0.451 -0.625 -0.924 1
3 -0.589 -0.624 -1.26 1
4 -0.506 -0.625 -1.09 1
5 NA 1.59 -0.593 1
6 -0.286 4.22 -0.0952 1
7 NA 2.91 -0.0952 1
8 NA 4.22 -0.924 1
9 -0.175 1.52 -0.0952 1
10 NA 1.74 1.56 1
# ... with 8,004 more rows
So for example I would like to check which are the closest rows to row 2 and row 3 that belong to Group ==1. Also, I have to do this efficiently, so a for
loop is not really an option.
I would like to use the dist
function because it has the nice feature of handling NA's properly, but I don't need to calculate the entire distance matrix - this would be a waste.
I tried this but it failed, and is also wasteful:
res = Map(function(x,y) dist(as.matrix(rbind(x, y))), df[2:3, ]
%>% group_by(Group), df %>% group_by(Group))
Upvotes: 0
Views: 115
Reputation: 2431
One way to do this, but it does create the entire distance matrix for each group. Not sure why that's wasteful, considering what you're trying to do:
library(tidyverse)
library(purrr)
min_dist <- function(x){
dist(x, upper = T) %>%
as.matrix %>%
as.tibble %>%
na_if(0) %>% #as.tibble adds zeros along the diagonal, so this removes them
summarize_all(funs(which(. == min(.,na.rm=TRUE)))) %>%
gather %>%
pull(value)
}
df %>% group_by(Group) %>%
mutate(group_row = row_number()) %>%
nest(-Group) %>%
mutate(nearest_row = map(data, min_dist)) %>%
unnest
Upvotes: 1