Reputation: 6874
I am trying to find and replace some text based on fuzzy matching as follows.
Aim
I want to do this for a list of find and replaces. I dont know how to extend the current function to allow this to happen.
Input
Input text df <- data.frame(textcol=c("In this substring would like to find the radiofrequency ablation of this HALO",
"I like to do endoscopic submuocsal resection and also radifrequency ablation",
"No match here","No mention of this radifreq7uency ablati0on thing"))
The attempt
##### Lower case the text ##########
df$textcol<-tolower(df$textcol)
#Need to define the pattern to match and what to replace it with
matchPattern <- "radiofrequency ablation"
findAndReplace<-function(matchPattern,rawText,replace)
{
positions <- aregexec(matchPattern, rawText, max.distance = 0.1)
regmatches(rawText, positions)
res <- regmatches(df$textcol, positions)
res[lengths(res)==0] <- "XXXX" # deal with 0 length matches somehow
#################### Term mapping ####################
df$out <- Vectorize(gsub)(unlist(res), replace, rawText)
df$out
}
matchPatternRFA <- c("radiofrequency ablation")
repRF<-findAndReplace(matchPatternRFA,rawText,"RFA")
repRF
The problem The above works fine for the replacement of one term, but what if I want to also replace endoscopic 'submucosal resection' with 'EMR' and 'HALO' with 'catheter'?
Ideally I'd like to create a list of terms to match but then how do I also specify how to replace them?
Upvotes: 1
Views: 93
Reputation: 269471
Define asub
to replace approximate matches with a replacement string and define a matching list L
that for each name defines its replacement. Then run Reduce
to perform the replacements.
asub <- function(pattern, replacement, x, fixed = FALSE, ...) {
m <- aregexec(pattern, x, fixed = fixed)
r <- regmatches(x, m)
lens <- lengths(r)
if (all(lens == 0)) return(x) else
replace(x, lens > 0, mapply(sub, r[lens > 0], replacement, x[lens > 0]))
}
L <- list("radiofrequency ablation" = "RFA",
"endoscopic submucosal resection" = "EMR",
"HALO" = "cathetar")
Reduce(function(x, nm) asub(nm, L[[nm]], x), init = df$textcol, names(L))
giving:
[1] "In this substring would like to find the RFA of this cathetar"
[2] "I like to do EMR and also RFA"
[3] "No match here"
[4] "No mention of this RFA thing"
Upvotes: 1
Reputation: 1
You can create a lookup table with patterns and necessary replacements:
dt <-
data.table(
textcol = c(
"In this substring would like to find the radiofrequency ablation of this HALO",
"I like to do endoscopic submuocsal resection and also radifrequency ablation",
"No match here",
"No mention of this radifreq7uency ablati0on thing"
)
)
dt_gsub <- data.table(
textcol = c("submucosal resection",
"HALO",
"radiofrequency ablation"),
textcol2 = c("EMR", "catheter", "RFA")
)
for (i in 1:nrow(dt))
for (j in 1:nrow(dt_gsub))
dt[i]$textcol <-
gsub(dt_gsub[j, textcol], dt_gsub[j, textcol2], dt[i, textcol])
Upvotes: 0