flora micy
flora micy

Reputation: 23

R regex to get partly match

I want to use stri_replace_all_regex to replace string but failed. I would like to know whether there are other methods to overcome it. Thanks for anyone who gives help to me!

try: the first:

> library(string)
> a <- c('abc2','xycd2','mnb345','tumb b~','lymavc') 
> b <- c('ab','abc','xyc','mnb','tum','mn','tumb','lym','lymav') 
> stri_replace_all_regex(a, "\\b" %s+% b %s+% "\\S+", b, vectorize_all=FALSE)

However, the result is :

> c("ab","xyc","mn" ,"tum b~","lym")

which is not I want. I want the result should be:

> c('abc','xyc','mnb','tumb','lymac')

the second:

> pattern <- paste0("\\b(", b, ")\\S+", collapse = "|")
> gsub(pattern, "\\w", a)

However it failed. I feel sorry it's my mistake that I do not express clearly. In fact, I want to replace b with a. As you see, a and b have some similar parts on the left, I want to remove the difference from a. But should be greedy match. For example: The result of 'tumb b~‘ should be 'thumb' not 'tum' and the result of 'mnb345‘ should be 'mnb' not 'mn'. I just learn regex expresion, so my try may be complex and cumbersome. Looking forward for your reply!

A new questions occurs.

a <- c('tums310','tums310~20','tums320')  
b<-c('tums1','tums2','tums3')

I want the result should be

"tums3" "tums3" "tums3"

Upvotes: 2

Views: 85

Answers (2)

GKi
GKi

Reputation: 39717

Maybe you are looking for adist.

a <- c('abc2','xycd2','mnb345','tumb b~','lymavc') 
b <- c('ab','abc','xyc','mnb','tum','mn','tumb','lym','lymav')
b[apply(adist(b, a) + adist(b, a, partial=TRUE), 2, which.min)]
#[1] "abc"   "xyc"   "mnb"   "tumb"  "lymav"

a <- c('tums310','tums310~20','tums320')  
b <- c('tums1','tums2','tums3')
b[apply(adist(b, a) + adist(b, a, partial=TRUE), 2, which.min)]
#[1] "tums3" "tums3" "tums3"

Upvotes: 2

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21442

Here's a fuzzy_join solution with the function stringdist_join:

library(fuzzyjoin)
stringdist_join(
  # join `b` as a dataframe ... 
  data.frame(b),
  # ... with `a` as a dataframe:
  data.frame(a),
  # join by ...:
  by = c("b" = "a")
  # use left join:
  mode = 'left',
  # use Jaro-Winkler distance metric:
  method = "jw",
  # enable case-insensitive matching:
  ignore_case = TRUE,
  # name for distance column:
  distance_col = 'dist') %>% 
# retain only closest matches:
group_by(a) %>%
  slice_min(order_by = dist, n = 1)
# A tibble: 5 × 3
# Groups:   a [5]
  b     a         dist
  <chr> <chr>    <dbl>
1 abc   abc2    0.0833
2 lymav lymavc  0.0556
3 mnb   mnb345  0.167 
4 tumb  tumb b~ 0.143 
5 xyc   xycd2   0.133

b contains now the most closely matching values for a.

Upvotes: 0

Related Questions