Reputation: 5033

Using variable to create regular expression pattern in R

I have a function:

ncount <- function(num = NULL) {

 toRead <- readLines("abc.txt")
 n <- as.character(num)
 x <- grep("{"n"} number",toRead,value=TRUE)

}

While grep-ing, I want the num passed in the function to dynamically create the pattern to be searched? How can this be done in R? The text file has number and text in every line

Upvotes: 1

Answers (2)

Wiktor Stribiżew

Reputation: 626747

In order to build a regular expression from variables in R, in the current scenarion, you may simply concatenate string literals with your variable using paste0:

grep(paste0('\\{', n, '} number'), homicides, value=TRUE)

Note that { is a special character outside a [...] bracket expression (also called character class), and should be escaped if you need to find a literal { char.

In case you use a list of items as an alternative list, you may use a combination of paste/paste0:

words <- c('bananas', 'mangoes', 'plums')
regex <- paste0('Ben likes (', paste(words, collapse='|'), ')\\.')

The resulting Ben likes (bananas|mangoes|plums)\. regex will match Ben likes bananas., Ben likes mangoes. or Ben likes plums.. See the R demo and the regex demo.

NOTE: PCRE (when you pass perl=TRUE to base R regex functions) or ICU (stringr/stringi regex functions) have proved to better handle these scenarios, it is recommended to use those engines rather than the default TRE regex library used in base R regex functions.

Oftentimes, you will want to build a pattern with a list of words that should be matched exactly, as whole words. Here, a lot will depend on the type of boundaries and whether the words can contain special regex metacharacters or not, whether they can contain whitespace or not.

In the most general case, word boundaries (\b) work well.

regex <- paste0('\\b(', paste(words, collapse='|'), ')\\b')
unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE)))
## => [1] "bananas" "mangoes" "plums"

The \b(bananas|mangoes|plums)\b pattern will match bananas, but won't match banana (see an R demo).

If your list is like

words <- c('cm+km', 'uname\\vname')

you will have to escape the words first, i.e. append \ before each of the metacharacter:

regex.escape <- function(string) {
  gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- c('Text: cm+km, and some uname\\vname?')
words <- c('cm+km', 'uname\\vname')
regex <- paste0('\\b(', paste(regex.escape(words), collapse='|'), ')\\b')
cat( unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE))) )
## => cm+km uname\vname

If your words can start or end with a special regex metacharacter, \b word boundaries won't work. Use

Unambiguous word boundaries, (?<!\w) / (?!\w), when the match is expected between non-word chars or start/end of string
Whitespace boundaries, (?<!\S) / (?!\S), when the match is expected to be enclosed with whitespace chars, or start/end of string
Build your own using the lookbehind/lookahead combination and your custom character class / bracket expression, or even more sophisticad patterns.

Example of the first two approaches in R (replacing with the match enclosed with << and >>):

regex.escape <- function(string) {
  gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- 'Text: cm+km, +km and C++,Delphi,C++CLI and C++/CLI.'
words <- c('+km', 'C++')
# Unambiguous word boundaries
regex <- paste0('(?<!\\w)(', paste(regex.escape(words), collapse='|'), ')(?!\\w)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and <<C++>>,Delphi,C++CLI and <<C++>>/CLI."
# Whitespace boundaries
regex <- paste0('(?<!\\S)(', paste(regex.escape(words), collapse='|'), ')(?!\\S)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and C++,Delphi,C++CLI and C++/CLI."

Upvotes: 0

Sven Hohenstein

Reputation: 81683

You could use paste to concatenate strings:

grep(paste("{", n, "} number", sep = ""),homicides,value=TRUE)

Upvotes: 5

Using variable to create regular expression pattern in R

Answers (2)

Related Questions