Sebastian Zeki
Sebastian Zeki

Reputation: 6874

How to pattern match for a list of strings

I am trying to find and replace some text based on fuzzy matching as follows.

Aim

I want to do this for a list of find and replaces. I dont know how to extend the current function to allow this to happen.

Input

Input text
 df <- data.frame(textcol=c("In this substring would like to find the radiofrequency ablation of this HALO",
                             "I like to do endoscopic submuocsal resection and also radifrequency ablation",
                             "No match here","No mention of this radifreq7uency ablati0on thing"))

The attempt

 ##### Lower case the text ##########
  df$textcol<-tolower(df$textcol)

  #Need to define the pattern to match and what to replace it with 
  matchPattern <- "radiofrequency ablation"


    findAndReplace<-function(matchPattern,rawText,replace)
{

positions <- aregexec(matchPattern, rawText, max.distance = 0.1)
regmatches(rawText, positions)
res <- regmatches(df$textcol, positions)
res[lengths(res)==0] <- "XXXX"  # deal with 0 length matches somehow

#################### Term mapping ####################
df$out <- Vectorize(gsub)(unlist(res), replace, rawText)
df$out
  }


 matchPatternRFA <- c("radiofrequency ablation")
repRF<-findAndReplace(matchPatternRFA,rawText,"RFA")
repRF

The problem The above works fine for the replacement of one term, but what if I want to also replace endoscopic 'submucosal resection' with 'EMR' and 'HALO' with 'catheter'?

Ideally I'd like to create a list of terms to match but then how do I also specify how to replace them?

Upvotes: 1

Views: 93

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269471

Define asub to replace approximate matches with a replacement string and define a matching list L that for each name defines its replacement. Then run Reduce to perform the replacements.

asub <- function(pattern, replacement, x, fixed = FALSE, ...) {
  m <- aregexec(pattern, x, fixed = fixed)
  r <- regmatches(x, m)
  lens <- lengths(r)
  if (all(lens == 0)) return(x) else
  replace(x, lens > 0, mapply(sub, r[lens > 0], replacement, x[lens > 0]))
}

L <- list("radiofrequency ablation" = "RFA", 
      "endoscopic submucosal resection" = "EMR",
      "HALO" = "cathetar")

Reduce(function(x, nm) asub(nm, L[[nm]], x), init = df$textcol, names(L))

giving:

[1] "In this substring would like to find the RFA of this cathetar"
[2] "I like to do EMR and also RFA"                                
[3] "No match here"                                                
[4] "No mention of this RFA thing"

Upvotes: 1

tporte
tporte

Reputation: 1

You can create a lookup table with patterns and necessary replacements:

dt <-
  data.table(
    textcol = c(
      "In this substring would like to find the radiofrequency ablation of this HALO",
      "I like to do endoscopic submuocsal resection and also radifrequency ablation",
      "No match here",
      "No mention of this radifreq7uency ablati0on thing"
    )
  )

dt_gsub <- data.table(
  textcol = c("submucosal resection",
              "HALO",
              "radiofrequency ablation"),
  textcol2 = c("EMR", "catheter", "RFA")
)

for (i in 1:nrow(dt))
  for (j in 1:nrow(dt_gsub))
    dt[i]$textcol <-
  gsub(dt_gsub[j, textcol], dt_gsub[j, textcol2], dt[i, textcol])

Upvotes: 0

Related Questions