gagolews
gagolews

Reputation: 13046

Replace a set of pattern matches with corresponding replacement strings in R

The str_replace (and preg_replace) function in PHP replaces all occurrences of the search string with the replacement string. What interests me the most here, is that if search and replace args are arrays (in R we call that vectors), then str_replace takes a value from each array (vector) and uses them to search and replace on subject.

In other words, does R (or some R package) have a function to perform the following:

string <- "The quick brown fox jumped over the lazy dog."
patterns     <- c("quick", "brown", "fox")
replacements <- c("slow",  "black", "bear")
xxx_replace_xxx(string, patterns, replacements)          ## ???
## [1] "The slow black bear jumped over the lazy dog."

So I am seeking for something like chartr, but for search patterns and replacement strings of arbitrary number of characters. This cannot be done via one call to gsub() as its replacement argument can be a single string only, see ?gsub. So my current implementation is like:

xxx_replace_xxx <- function(string, patterns, replacements) {
   for (i in seq_along(patterns))
      string <- gsub(patterns[i], replacements[i], string, fixed=TRUE)
   string
}

However, I am looking for something much faster if length(patterns) is large - I have a lot of data to process and I'm dissatisfied with the current results.

Exemplary toy data for benchmarking:

string <- readLines("http://www.gutenberg.org/files/31536/31536-0.txt", encoding="UTF-8")
patterns <- c("jak", "to", "do", "z", "na", "i", "w", "za", "tu", "gdy",
   "po", "jest", "Tadeusz", "lub", "razem", "nas", "przy", "oczy", "czy",
   "sam", "u", "tylko", "bez", "ich", "Telimena", "Wojski", "jeszcze")
replacements <- paste0(patterns, rev(patterns))

Upvotes: 7

Views: 4699

Answers (3)

gagolews
gagolews

Reputation: 13046

This can be done with stringi >= 0.3-1 by using one of the stri_replace_*_all functions with the vectorize_all argument set to FALSE:

library("stringi")
string <- "The quicker brown fox jumped over the lazy dog."
patterns     <- c("quick", "brown", "fox")
replacements <- c("slow",  "black", "bear")
stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE)
## [1] "The slower black bear jumped over the lazy dog."
stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)
## [1] "The quicker black bear jumped over the lazy dog."

Some benchmarks:

string <- readLines("http://www.gutenberg.org/files/31536/31536-0.txt", encoding="UTF-8")
patterns <- c("jak", "to", "do", "z", "na", "i", "w", "za", "tu", "gdy",
   "po", "jest", "Tadeusz", "lub", "razem", "nas", "przy", "oczy", "czy",
   "sam", "u", "tylko", "bez", "ich", "Telimena", "Wojski", "jeszcze")
replacements <- paste0(patterns, rev(patterns))
microbenchmark::microbenchmark(
   stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE),
   stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE),
   xxx_replace_xxx_pcre(string, "\\b" %s+% patterns %s+% "\\b", replacements),
   gsubfn("\\b\\w+\\b", as.list(setNames(replacements, patterns)), string),
   unit="relative",
   times=25
)
## Unit: relative
##                   expr       min        lq      mean    median        uq       max neval
## stri_replace_all_fixed  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000    25 
## stri_replace_all_regex  2.169701  2.248115  2.198638  2.267935  2.267635  1.753289    25  
## xxx_replace_xxx_pcre    1.983135  1.967303  1.937021  1.961449  1.974422  1.469894    25  
## gsubfn                 63.067835 69.870657 69.815031 71.178841 72.503020 57.019072    25  

So, as far as matching only at word boundaries is concerned, the PCRE-based version is the fastest.

Upvotes: 5

G. Grothendieck
G. Grothendieck

Reputation: 269461

If the patterns are fixed strings made of word characters as in the example then this works. gsubfn is like gsub except the replacment argument can be a string, list, function or proto object. If its a list, as here, it compares the matches to the regular expression with the names and for those that are found it replaces them with the corresponding values:

library(gsubfn)

gsubfn("\\b\\w+\\b", as.list(setNames(replacements, patterns)), string)
## [1] "The slow black bear jumped over the lazy dog."

Upvotes: 8

Joshua Ulrich
Joshua Ulrich

Reputation: 176648

Using PCRE instead of fixed matching takes ~1/3 the time on my machine for your example.

xxx_replace_xxx_pcre <- function(string, patterns, replacements) {
   for (i in seq_along(patterns))
      string <- gsub(patterns[i], replacements[i], string, perl=TRUE)
   string
}
system.time(x <- xxx_replace_xxx(string, patterns, replacements))
#    user  system elapsed 
#   0.491   0.000   0.491 
system.time(p <- xxx_replace_xxx_pcre(string, patterns, replacements))
#    user  system elapsed 
#   0.162   0.000   0.162 
identical(x,p)
# [1] TRUE

Upvotes: 10

Related Questions