find matching value in a column based on other vector

Question

I have a data frame and vector like this:

df1 <- data.frame(orig = c(1,1,1,2,2,2,2,3,3),
                  proxy = c(1,43,65,2,44,45,46,3,55),
                  dist = c(0, 100,101, 10, 1000, 5000, 5001,0,3))

v <- c(1,45:100)

I now want the following:

For each unique value in df1$orig (here it's a numeric for simplicity, but it could be character too), if the same orig value is not available in v, find the best proxy that has the lowest dist.

In this example the first value in df1$orig is 1 and this value is available in v as well, so we take it. The second unique value in df$orig is 2 and this is not available in v. The best proxy with the lowest dist is 44 in this case, but it is not in v either. The next best is 45 and this value is in v so we take it. The third unique value in df1$orig is 3 and there is no 3 in v. The best proxy here is 55.

the solution is c(1,45,55)

Note that the first value for each orig in proxy is the orig value. dist is sorted here but not necessarily the case always.

Mikko Marttila · Accepted Answer

This can be done in a couple of steps with {dplyr}: keep the proxies that are in v, sort by dist and pick the first for each orig:

library(dplyr)

df1 %>% 
  filter(proxy %in% v) %>% 
  arrange(dist) %>% 
  group_by(orig) %>% 
  slice(1)
#> # A tibble: 3 x 3
#> # Groups:   orig [3]
#>    orig proxy  dist
#>     
#> 1     1     1     0
#> 2     2    45  5000
#> 3     3    55     3

^{Created on 2019-09-11 by the reprex package (v0.3.0)}

find matching value in a column based on other vector

Answers (2)

Related Questions