David Z
David Z

Reputation: 7041

How to substitute multiple words with spaces in R?

Here is an example:

drugs<-c("Lapatinib-Ditosylate", "Caffeic-Acid-Phenethyl-Ester", "Pazopanib-HCl", "D-Pantethine")

ads<-"These are recently new released drugs Lapatinib Ditosylate, Pazopanib HCl, and Caffeic Acid Phenethyl Ester"

What I wanted is to correct the drug names in ads with the names in drugs such that a desired output would be:

"These are recently new released drugs Lapatinib-Ditosylate, Pazopanib-HCl, and Caffeic-Acid-Phenethyl-Ester"

Upvotes: 0

Views: 59

Answers (2)

IceCreamToucan
IceCreamToucan

Reputation: 28685

If you create a vector of words to be replaced, then you can loop over that vector and the vector of words to replace them (drugs), replacing all instances of one element in each interation of the loop.

to_repl <- gsub('-', ' ', drugs)

for(i in seq_along(drugs))
  ads <- gsub(to_repl[i], drugs[i], ads)

ads
# "These are recently new released drugs Lapatinib-Ditosylate, Pazopanib-HCl, and Caffeic-Acid-Phenethyl-Ester"

Contrary to popular belief, for-loops in R are no slower than lapply

f_lapply <- function(ads){
  to_repl <- gsub('-', ' ', drugs)
  invisible(lapply(seq_along(to_repl), function(i) {
    ads <<- gsub(to_repl[i], drugs[i], ads)
  }))
  ads
}
f_loop <- function(ads){
  to_repl <- gsub('-', ' ', drugs)
  for(i in seq_along(to_repl))
    ads <- gsub(to_repl[i], drugs[i], ads)
  ads
}

f_loop(ads) == f_lapply(ads)
# [1] TRUE

microbenchmark::microbenchmark(f_loop(ads), f_lapply(ads), times = 1e4)
# Unit: microseconds
#           expr    min      lq     mean  median      uq       max neval
#    f_loop(ads) 59.488  95.180 118.0793 107.487 120.205  7426.866 10000
#  f_lapply(ads) 69.333 114.462 147.9732 130.872 152.205 27283.670 10000

Or, using more general examples:

loop_over <- 1:1e5
microbenchmark::microbenchmark(
  for_loop = {for(i in loop_over) 1},
  lapply   = {lapply(loop_over, function(x) 1)}
  )
# Unit: milliseconds
#      expr      min         lq       mean     median         uq       max neval
#  for_loop  4.66174   5.865842   7.725975   6.354867   7.449429  35.26807   100
#    lapply 94.09223 114.378778 125.149863 124.665128 134.217326 170.16889   100

loop_over <- 1:1e5
microbenchmark::microbenchmark(
  for_loop = {y <- numeric(1e5); for(i in seq_along(loop_over)) y[i] <- loop_over[i]},
  lapply   = {lapply(loop_over, function(x) x)}
  )
# Unit: milliseconds
#      expr      min       lq     mean   median       uq     max neval
#  for_loop 11.00184 11.49455 15.24015 12.10461 15.26050 134.139   100
#    lapply 71.41820 81.14660 93.64569 87.05162 98.59295 357.219   100

Upvotes: 2

nsinghphd
nsinghphd

Reputation: 2022

This can also be done using lapply() which will be faster than for loop. Modifying @IceCreamToucan's answer, this can be done in lapply as follows

to_repl <- gsub('-', ' ', drugs)

invisible(lapply(seq_along(to_repl), function(i) {
  ads <<- gsub(to_repl[i], drugs[i], ads)
}))

# [1] "These are recently new released drugs Lapatinib-Ditosylate, Pazopanib-HCl, and Caffeic-Acid-Phenethyl-Ester"

Microbenchmark

Unit: microseconds
     expr      min        lq      mean   median        uq      max neval
   lapply   80.514   87.4935  110.1103   93.304   96.1995 1902.861   100
 for.loop 2285.164 2318.5665 2463.1554 2338.216 2377.4120 7510.763   100

Upvotes: 0

Related Questions