Reputation: 15
I know similar question might have asked in this forum but I feel my requirement is peculiar. I have a data frame with a column with the following values. Below is the just sample and it contains more than 1000 observations
Reported Terms
"2 Left Axillary Lymph Nodes Resection"
"cardyoohyper"
"Ablation Breast"
"Hypercarido"
"chordiohyper"
"Adenocarcinoma Of Colon (Radical Resection And Cr)"
"myocasta"
"hypermyopa"
I have another data frame with the below rules:
Data frame
I am expecting the below output:
"2 Left Axillary Lymph Nodes Resection"
"carddiohiper"
"Ablation Breast"
"hipercardio"
"cardiohyper"
"Adenocarcinoma Of Colon (Radical Resection And Cr)"
"miocasta"
"hipermiopa"
I am trying with hot encoding with gsub function but I understand that it will take a lot time.
pattern <- c("kardio, "carido", "cardyo", "cordio", "chordio")
replacement <- "cardio"
gusub(pattern,replacement,df$reportedterms)
with the above approach I need to encode every time for every rule and I need to create different variables each time for pattern and replacement in gsub function.
Is there a simple approach to solve this problem?
Upvotes: 0
Views: 166
Reputation: 12410
First let's set this up as described by you:
library(tibble)
df <- tibble(text = c("2 Left Axillary Lymph Nodes Resection",
"cardyoohyper",
"Ablation Breast",
"Hypercarido",
"chordiohyper",
"Adenocarcinoma Of Colon (Radical Resection And Cr)",
"myocasta",
"hypermyopa"))
replace_dict <- tibble(pattern = list(c("kardio", "carido", "cardyo", "cordio", "chordio"),
"myoca",
"myopa",
"hyper"),
replacement = c("cardio",
"mioca",
"miopa",
"hiper"))
I would simply use stringi
for the task as it has an extremely efficient version of gsub
which is stri_replace_all_fixed
(note that you could also use the regex version, which is a bit slower but works the same). It can handle several patterns and replacements at the same time, so all we need to do is unnest the pattern column first and then run stringi
:
batch_replace <- function(text, replace_dict) {
replace_dict <- tidyr::unnest(replace_dict, pattern)
stringi::stri_replace_all_fixed(str = text,
pattern = replace_dict$pattern,
replacement = replace_dict$replacement,
vectorize_all = FALSE)
}
Let's put this function to a test:
df$text_new <- batch_replace(df$text, replace_dict)
df
#> # A tibble: 8 x 2
#> text text_new
#> <chr> <chr>
#> 1 2 Left Axillary Lymph Nodes Resecti~ 2 Left Axillary Lymph Nodes Resecti~
#> 2 cardyoohyper cardioohiper
#> 3 Ablation Breast Ablation Breast
#> 4 Hypercarido Hypercardio
#> 5 chordiohyper cardiohiper
#> 6 Adenocarcinoma Of Colon (Radical Re~ Adenocarcinoma Of Colon (Radical Re~
#> 7 myocasta miocasta
#> 8 hypermyopa hipermiopa
I think that is what you wanted. Note that the function isn't very flexible as you have to provide stri_replace_all_fixed
exactly in the way shown. Since you haven't shared the file, I can't help you with wrangling into that form, so you have to figure that out or ask a new question.
If you want replacement to be case insensitive and also want to lowercase the text, the function could look like this:
batch_replace <- function(text, replace_dict, to_lower = TRUE, case_insensitive = TRUE) {
replace_dict <- tidyr::unnest(replace_dict, pattern)
if (to_lower) {
text <- tolower(text)
}
stringi::stri_replace_all_fixed(str = text,
pattern = replace_dict$pattern,
replacement = replace_dict$replacement,
vectorize_all = FALSE,
opts_fixed = stringi::stri_opts_fixed(case_insensitive = case_insensitive))
}
You can turn on/off lower casing and case-insensitive replacement as you need it.
Upvotes: 1