tomasz
tomasz

Reputation: 39

How can I match duplicated values in a column, examine the condition and output a final result in R?

I have a dictionary table like this:

ID Position Region
a 1-50 D1a
a 80-100 D2a
a 250-300 D3a
b 50-100 D1b
b 150-180 D2b
c 1-20 D1c
c 50-80 D2c
c 100-200 D3c
c 250-300 D4c

And a target table like this:

ID Position
a 28
a 85
a 320
b 55
b 100
c 18
c 45
c 180
c 270

The logic is to examine whether the numeric value-Position in target table is within the range of position in dictionary table and output the region values with a given ID.

I first thought that R package hash should work. But later I found that the hash keys must be unique, which in my case are not.

How can I match each ID first, and use if between() or other functions to map my targets to their respective region?

Upvotes: 0

Views: 93

Answers (3)

Ronak Shah
Ronak Shah

Reputation: 389275

You may use package fuzzyjoin to perform the join on range after splitting the Position column into two columns.

Using data from @Guillaume -

library(dplyr)
library(fuzzyjoin)
library(tidyr)

one %>%
  separate(Position, c('min', 'max'), sep = '-', convert = TRUE) %>%
  fuzzy_right_join(two, by = c('ID', 'min' = 'Position', 'max' = 'Position'), 
                  match_fun = c(`==`, `<=`, `>=`)) %>%
  select(ID = ID.y, Position, min, max, Region)

#  ID    Position   min   max Region
#  <chr>    <int> <int> <int> <chr> 
#1 a           28     1    50 D1a   
#2 a           85    80   100 D2a   
#3 a          320    NA    NA NA    
#4 b           55    50   100 D1b   
#5 b          100    50   100 D1b   
#6 c           18     1    20 D1c   
#7 c           45    NA    NA NA    
#8 c          180   100   200 D3c   
#9 c          270   250   300 D4c   

Based on your expected output you may use fuzzy_(inner_join/left_join/full_join).

Upvotes: 0

Peace Wang
Peace Wang

Reputation: 2419

Is this your expected result?

library(data.table)

dt1 <- fread("
ID  Position    Region
a   1-50    D1a
a   80-100  D2a
a   250-300 D3a
b   50-100  D1b
b   150-180 D2b
c   1-20    D1c
c   50-80   D2c
c   100-200 D3c
c   250-300 D4c
")

dt2 <- fread("
ID  Position
a   28
a   85
a   320
b   55
b   100
c   18
c   45
c   180
c   270
")

#dt1[,c("Position_left","Position_right") := tstrsplit(Position,"-")]
#dt1[, dt2$Position %between% list(Position_left,Position_right)]
# [1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE

dt1[,.( ID,
        Position = dt2$Position,
        Region = fifelse(dt2$Position %between% tstrsplit(Position,"-"),Region,NA)
       )]

       ID Position Region
   <char>    <int> <char>
1:      a       28    D1a
2:      a       85    D2a
3:      a      320   <NA>
4:      b       55    D1b
5:      b      100   <NA>
6:      c       18    D1c
7:      c       45   <NA>
8:      c      180    D3c
9:      c      270    D4c

Upvotes: 0

Guillaume
Guillaume

Reputation: 681

Maybe something like this with tidy verse approach (tidyr::separate)

However it's a quick answer and I'm not sure to understand your data relations as you want.

library(tidyverse)

one <- tibble::tribble(
  ~ID, ~Position, ~Region,
  "a",    "1-50",   "D1a",
  "a",  "80-100",   "D2a",
  "a", "250-300",   "D3a",
  "b",  "50-100",   "D1b",
  "b", "150-180",   "D2b",
  "c",    "1-20",   "D1c",
  "c",   "50-80",   "D2c",
  "c", "100-200",   "D3c",
  "c", "250-300",   "D4c"
  )

two <- tibble::tribble(
         ~ID, ~Position,
         "a",       28L,
         "a",       85L,
         "a",      320L,
         "b",       55L,
         "b",      100L,
         "c",       18L,
         "c",       45L,
         "c",      180L,
         "c",      270L
         )


one_ <- one %>% 
  tidyr::separate(Position, c('p_min', 'p_max'), sep = "-") %>% 
  mutate_at(vars(starts_with('p_')), as.integer)

two %>% 
  mutate(rn = row_number()) %>% 
  left_join(one_) %>% 
  mutate(in_between = (Position >= p_min & Position <= p_max)) %>% 
  filter(in_between) %>% 
  distinct(rn, .keep_all = TRUE)
Joining, by = "ID"
# A tibble: 7 × 7
  ID    Position    rn p_min p_max Region in_between
  <chr>    <int> <int> <int> <int> <chr>  <lgl>     
1 a           28     1     1    50 D1a    TRUE      
2 a           85     2    80   100 D2a    TRUE      
3 b           55     4    50   100 D1b    TRUE      
4 b          100     5    50   100 D1b    TRUE      
5 c           18     6     1    20 D1c    TRUE      
6 c          180     8   100   200 D3c    TRUE      
7 c          270     9   250   300 D4c    TRUE  

Upvotes: 1

Related Questions