Corvus
Corvus

Reputation: 8049

Is there an R function to escape a string for regex characters

I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.

Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?

For example (made up function):

x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"

Upvotes: 35

Views: 12654

Answers (6)

Friede
Friede

Reputation: 7524

Nowadays

> stringr::str_escape(x)
[1] "foo\\[bar\\]"

might be enough.

Upvotes: 0

antonio
antonio

Reputation: 11120

According to ?regex:

The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]).

Therefore, using capture groups, (\\W), we can detect the occurrences of non-word characters and escape it with the \\1-syntax:

> gsub("(\\W)", "\\\\\\1", "[](){}.|^+$*?\\These are words")
[1] "\\[\\]\\(\\)\\{\\}\\.\\|\\^\\+\\$\\*\\?\\\\These\\ are\\ words"

Or similarly, replacing "([^[:alnum:]_])" for "(\\W)".

Upvotes: 1

Ryan C. Thompson
Ryan C. Thompson

Reputation: 42020

Use the rex package

These days, I write all my regular expressions using rex. For your specific example, rex does exactly what you want:

library(rex)
library(assertthat)
x = "foo[bar]"
y = rex(x)
assert_that(y == "foo\\[bar\\]")

But of course, rex does a lot more than that. The question mentions building a regex, and that's exactly what rex is designed for. For example, suppose we wanted to match the exact string in x, with nothing before or after:

x = "foo[bar]"
y = rex(start, x, end)

Now y is ^foo\[bar\]$ and will only match the exact string contained in x.

Upvotes: 2

Paul Lemmens
Paul Lemmens

Reputation: 625

An easier way than @ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.

Upvotes: 7

Ryan C. Thompson
Ryan C. Thompson

Reputation: 42020

I've written an R version of Perl's quotemeta function:

library(stringr)
quotemeta <- function(string) {
  str_replace_all(string, "(\\W)", "\\\\\\1")
}

I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.

Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:

This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:

$pattern =~ s/(\W)/\\$1/g;

As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):

Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.

which reinforces my point that this solution is only guaranteed for PCRE.

Upvotes: 33

Dason
Dason

Reputation: 61943

Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':

gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)

My previous answer:

I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.

re.escape <- function(strings){
    vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)", 
              "\\{", "\\}", "\\^", "\\$","\\*", 
              "\\+", "\\?", "\\.", "\\|")
    replace.vals <- paste0("\\\\", vals)
    for(i in seq_along(vals)){
        strings <- gsub(vals[i], replace.vals[i], strings)
    }
    strings
}

Some output

> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"  

Upvotes: 21

Related Questions