spore234
spore234

Reputation: 3640

find matching value in a column based on other vector

I have a data frame and vector like this:

df1 <- data.frame(orig = c(1,1,1,2,2,2,2,3,3),
                  proxy = c(1,43,65,2,44,45,46,3,55),
                  dist = c(0, 100,101, 10, 1000, 5000, 5001,0,3))

v <- c(1,45:100)

I now want the following:

For each unique value in df1$orig (here it's a numeric for simplicity, but it could be character too), if the same orig value is not available in v, find the best proxy that has the lowest dist.

In this example the first value in df1$orig is 1 and this value is available in v as well, so we take it. The second unique value in df$orig is 2 and this is not available in v. The best proxy with the lowest dist is 44 in this case, but it is not in v either. The next best is 45 and this value is in v so we take it. The third unique value in df1$orig is 3 and there is no 3 in v. The best proxy here is 55.

the solution is c(1,45,55)

Note that the first value for each orig in proxy is the orig value. dist is sorted here but not necessarily the case always.

Upvotes: 2

Views: 68

Answers (2)

GKi
GKi

Reputation: 39657

In case you are beside a dplyr solution also interested in a base solution.

Fist reduce to those which have a match between proxy and v, then order by orig and dist and then take those which are not duplicated.

tt <- df1[df1$proxy %in% v,]
tt <- tt[order(tt$orig, tt$dist),]
tt[!duplicated(tt$orig),]
#  orig proxy dist
#1    1     1    0
#6    2    45 5000
#9    3    55    3

Or in case you losse some orig when there is not match between proxy and v you can use:

tt <- df1[df1$proxy %in% v,]
tt <- tt[order(tt$orig, tt$dist),]
tt <- tt[!duplicated(tt$orig),c("orig", "proxy")]
tt$proxy[match(unique(df1$orig), tt$orig)]
#[1]  1 45 55

Upvotes: 1

Mikko Marttila
Mikko Marttila

Reputation: 11878

This can be done in a couple of steps with {dplyr}: keep the proxies that are in v, sort by dist and pick the first for each orig:

library(dplyr)

df1 %>% 
  filter(proxy %in% v) %>% 
  arrange(dist) %>% 
  group_by(orig) %>% 
  slice(1)
#> # A tibble: 3 x 3
#> # Groups:   orig [3]
#>    orig proxy  dist
#>   <dbl> <dbl> <dbl>
#> 1     1     1     0
#> 2     2    45  5000
#> 3     3    55     3

Created on 2019-09-11 by the reprex package (v0.3.0)

Upvotes: 3

Related Questions