rmuc8
rmuc8

Reputation: 2989

Replace String B with String C if it contains (but not exactly matches) String A

I have a data frame match_df which shows "matching rules": the column old should be replaced with the colum new in the dataframes it is applied on.

old <- c("10000","20000","300ZZ","40000")
new <- c("Name1","Name2","Name3","Name4")
match_df <- data.frame(old,new)

  old   new
1 10000 Name1
2 20000 Name2
3 300ZZ Name3  # watch the letters
4 40000 Name4

I want to apply the matching rules above on a data frame working_df

id <- c(1,2,3,4)
value <- c("xyz-10000","20000","300ZZ-230002112","40")
working_df <- data.frame(id,value)

   id   value
1  1    xyz-10000
2  2    20000
3  3    300ZZ-230002112
4  4    40

My desired result is

# result

   id   value
1  1    Name1
2  2    Name2
3  3    Name3
4  4    40 

This means that I am not looking for an exact match. I'd rather like to replace the whole string working_df$value as soon as it includes any part of the string in match_df$old.

I like the solution posted in R: replace characters using gsub, how to create a function?, but it works only for exact matches. I experimented with gsub, str_replace_all from stringr but I couldn't find a solution that works for me. There are many solutions for exact matches on SOF, but I couldn't find a comprehensible one for this problem.

Any help is highly appreciated.

Upvotes: 0

Views: 152

Answers (3)

Tyler Rinker
Tyler Rinker

Reputation: 109924

Here are 2 approaches using Map + <<- and a for loop:

working_df[["value2"]] <- as.character(working_df[["value"]])
Map(function(x, y){working_df[["value2"]][grepl(x, working_df[["value2"]])] <<- y}, old, new)

working_df

##   id           value value2
## 1  1       xyz-10000  Name1
## 2  2           20000  Name2
## 3  3 300ZZ-230002112  Name3
## 4  4              40     40

## or...

working_df[["value2"]] <- as.character(working_df[["value"]])
for (i in seq_along(working_df[["value2"]])) {
    working_df[["value2"]][grepl(old[i], working_df[["value2"]])] <- new[i]
}

Upvotes: 0

NicE
NicE

Reputation: 21425

I'm not sure this is the most elegant/efficient way of doing it but you could try something like this:

working_df$value <- sapply(working_df$value,function(y){ 
  idx<-which(sapply(match_df$old,function(x){grepl(x,y)}))[1]
  if(is.na(idx)) idx<-0
  ifelse(idx>0,as.character(match_df$new[idx]),as.character(y))
})

It uses grepl to find, for each value of working_df, if there is a row of match_df that is partially matching and get the index of that row. If there is more than one, it takes the first one.

Upvotes: 1

Chris
Chris

Reputation: 7288

You need the grep function. This will return the indices of a vector that match a pattern (any pattern, not necessarily a full string match). For instance, this will tell you which of your "old" values match the "10000" pattern:

grep(match_df[1,1], working_df$value)

Once you have that information, you can look up the corresponding "new" value for that pattern, and replace it on the matching rows.

Upvotes: 0

Related Questions