Pavan kumar
Pavan kumar

Reputation: 15

Replace rules(String pattern matching) in R

I know similar question might have asked in this forum but I feel my requirement is peculiar. I have a data frame with a column with the following values. Below is the just sample and it contains more than 1000 observations

Reported Terms

"2 Left Axillary Lymph Nodes Resection"             
"cardyoohyper"                                      
"Ablation Breast"                                    
"Hypercarido"                                       
"chordiohyper"                                       
"Adenocarcinoma Of Colon (Radical Resection And Cr)"
"myocasta"
"hypermyopa"

I have another data frame with the below rules:

Data frame

enter image description here

I am expecting the below output:

"2 Left Axillary Lymph Nodes Resection"             
"carddiohiper"                                      
"Ablation Breast"                                    
"hipercardio"                                       
"cardiohyper"                                       
"Adenocarcinoma Of Colon (Radical Resection And Cr)"
"miocasta"
"hipermiopa"

I am trying with hot encoding with gsub function but I understand that it will take a lot time.

pattern <- c("kardio, "carido", "cardyo", "cordio", "chordio")
replacement <- "cardio"
gusub(pattern,replacement,df$reportedterms)

with the above approach I need to encode every time for every rule and I need to create different variables each time for pattern and replacement in gsub function.

Is there a simple approach to solve this problem?

Upvotes: 0

Views: 166

Answers (1)

JBGruber
JBGruber

Reputation: 12410

First let's set this up as described by you:

library(tibble)

df <- tibble(text = c("2 Left Axillary Lymph Nodes Resection",
                      "cardyoohyper",
                      "Ablation Breast",
                      "Hypercarido",
                      "chordiohyper",
                      "Adenocarcinoma Of Colon (Radical Resection And Cr)",
                      "myocasta",
                      "hypermyopa"))

replace_dict <- tibble(pattern = list(c("kardio", "carido", "cardyo", "cordio", "chordio"), 
                                      "myoca",
                                      "myopa",
                                      "hyper"),
                       replacement = c("cardio", 
                                       "mioca",
                                       "miopa",
                                       "hiper"))

I would simply use stringi for the task as it has an extremely efficient version of gsub which is stri_replace_all_fixed (note that you could also use the regex version, which is a bit slower but works the same). It can handle several patterns and replacements at the same time, so all we need to do is unnest the pattern column first and then run stringi:

batch_replace <- function(text, replace_dict) {

  replace_dict <- tidyr::unnest(replace_dict, pattern) 

  stringi::stri_replace_all_fixed(str = text, 
                                  pattern = replace_dict$pattern, 
                                  replacement = replace_dict$replacement, 
                                  vectorize_all = FALSE)
}

Let's put this function to a test:

df$text_new <- batch_replace(df$text, replace_dict)
df
#> # A tibble: 8 x 2
#>   text                                 text_new                            
#>   <chr>                                <chr>                               
#> 1 2 Left Axillary Lymph Nodes Resecti~ 2 Left Axillary Lymph Nodes Resecti~
#> 2 cardyoohyper                         cardioohiper                        
#> 3 Ablation Breast                      Ablation Breast                     
#> 4 Hypercarido                          Hypercardio                         
#> 5 chordiohyper                         cardiohiper                         
#> 6 Adenocarcinoma Of Colon (Radical Re~ Adenocarcinoma Of Colon (Radical Re~
#> 7 myocasta                             miocasta                            
#> 8 hypermyopa                           hipermiopa

I think that is what you wanted. Note that the function isn't very flexible as you have to provide stri_replace_all_fixed exactly in the way shown. Since you haven't shared the file, I can't help you with wrangling into that form, so you have to figure that out or ask a new question.

update

If you want replacement to be case insensitive and also want to lowercase the text, the function could look like this:

batch_replace <- function(text, replace_dict, to_lower = TRUE, case_insensitive = TRUE) {

  replace_dict <- tidyr::unnest(replace_dict, pattern) 

  if (to_lower) {
    text <- tolower(text)
  }

  stringi::stri_replace_all_fixed(str = text, 
                                  pattern = replace_dict$pattern, 
                                  replacement = replace_dict$replacement, 
                                  vectorize_all = FALSE,
                                  opts_fixed = stringi::stri_opts_fixed(case_insensitive = case_insensitive))
}

You can turn on/off lower casing and case-insensitive replacement as you need it.

Upvotes: 1

Related Questions