887
887

Reputation: 619

How to identify closest observation within the same column?

I am working with a data frame resembling the following:

enter image description here I am looking to create columns that identify the name and x-location value of the nearest x-location for every observation within each Day. So the above table would become:

enter image description here

Where closest_x is the nearest x_location value to that row's x-location within the same day. closest_x_name is the name of the closest person. second_closest_x is the value of the second nearest x_location within the column and so on. Note that Harry still appears twice in the second day, but only appears once amongst the closest, second_closest, etc. columns...we keep all his observations, but only rank him once.

I figured one way to approach this would be to use which.min.

library(tidyverse)

dat1 <- dat %>%
  group_by(Day) %>%
  mutate(x_closest = which.min(abs(x_location -x_location)))

But such approach doesn't work for a number of reasons.

structure(list(Day = c("Monday", "Monday", "Monday", "Monday", 
"Monday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday"), Name = c("Steve", "Joe", "Brian", "Mike", "Harry", 
"Steve", "Joe", "Brian", "Mike", "Harry", "Harry"), x_location = c(2.5, 
6, 4.5, 8, 5, 1, 3.5, 4, 9.5, 10, 10), closest_x = c(4.5, 5, 
5, 6, 4.5, 3.5, 4, 3.5, 10, 9.5, 9.5), closest_x_name = c("Brian", 
"Harry", "Harry", "Joe", "Brian", "Joe", "Brian", "Joe", "Harry", 
"Mike", "Mike"), second_closest_x = c(5, 4.5, 6, 5, 6, 4, 1, 
1, 4, 4, 4), second_closest_name = c("Harry", "Brian", "Joe", 
"Harry", "Joe", "Brian", "Steve", "Steve", "Brian", "Brian", 
"Brian"), third_closest_x = c(6, 8, 2.5, 4.5, 2.5, 9.5, 9.5, 
9.5, 3.5, 3.5, 3.5), third_closest_name = c("Joe", "Mike", "Steve", 
"Brian", "Steve", "Mike", "Mike", "Mike", "Joe", "Joe", "Joe"
)), class = "data.frame", row.names = c(NA, -11L))

Upvotes: 2

Views: 92

Answers (1)

zephryl
zephryl

Reputation: 17309

Here’s a tidyverse-based solution:

library(dplyr)
library(purrr)
library(tidyr)

closest_rows <- function(x_location_cur, Name_cur) {
  closest <- cur_data()$Name[order(abs(cur_data()$x_location - x_location_cur))]
  tibble(
    dist = c("closest", "second_closest", "third_closest"),
    name = unique(closest[closest != Name_cur])[1:3],
    x = cur_data()$x_location[match(name, cur_data()$Name)]
  )
}

dat %>%
  group_by(Day) %>%
  mutate(
    row = row_number(),
    dists = map2(x_location, Name, closest_rows)
  ) %>% 
  unnest(dists) %>% 
  pivot_wider(
    names_from = dist, 
    values_from = name:x,
    names_glue = "{dist}_{.value}",
    names_vary = "slowest"
  ) %>% 
  ungroup() %>% 
  select(!row)
# A tibble: 11 × 9
   Day     Name  x_location closest_name closest_x second_…¹ secon…² third…³ third…⁴
   <chr>   <chr>      <dbl> <chr>            <dbl> <chr>       <dbl> <chr>     <dbl>
 1 Monday  Steve        2.5 Brian              4.5 Harry         5   Joe         6  
 2 Monday  Joe          6   Harry              5   Brian         4.5 Mike        8  
 3 Monday  Brian        4.5 Harry              5   Joe           6   Steve       2.5
 4 Monday  Mike         8   Joe                6   Harry         5   Brian       4.5
 5 Monday  Harry        5   Brian              4.5 Joe           6   Steve       2.5
 6 Tuesday Steve        1   Joe                3.5 Brian         4   Mike        9.5
 7 Tuesday Joe          3.5 Brian              4   Steve         1   Mike        9.5
 8 Tuesday Brian        4   Joe                3.5 Steve         1   Mike        9.5
 9 Tuesday Mike         9.5 Harry             10   Brian         4   Joe         3.5
10 Tuesday Harry       10   Mike               9.5 Brian         4   Joe         3.5
11 Tuesday Harry       10   Mike               9.5 Brian         4   Joe         3.5
# … with abbreviated variable names ¹​second_closest_name, ²​second_closest_x,
#   ³​third_closest_name, ⁴​third_closest_x

Walkthrough:

For each row, the closest_rows() helper function gets indices of other rows in the group ordered by distance; removes rows with duplicate names or with the same name as the current row; then returns a tibble with names and x locations for the three closest matches. This yields a list column, which you can unnest, then pivot wider to get your desired format.

Upvotes: 1

Related Questions