R code to check if word matches pattern

Question

I need to validate a string against a character vector pattern. My current code is:

trim <- function (x) gsub("^\s+|\s+$", "", x)

# valid pattern is lowercase alphabet, '.', '!', and '?' AND
# the string length should be >= than 2
my.pattern = c(letters, '!', '.', '?')

check.pattern = function(word, min.size = 2)
{
    word = trim(word)
    chars = strsplit(word, NULL)[[1]]
    all(chars %in% my.pattern) && (length(chars) >= min.size)
}

Example:

w.valid = 'special!'
w.invalid = 'test-me'

check.pattern(w.valid) #TRUE
check.pattern(w.invalid) #FALSE

This is VERY SLOW i guess...is there a faster way to do this? Regex maybe? Thanks!

PS: Thanks everyone for the great answers. My objective was to build a 29 x 29 matrix, where the row names and column names are the allowed characters. Then i iterate over each word of a huge text file and build a 'letter precedence' matrix. For example, consider the word 'special', starting from the first char:

row s, col p -> increment 1
row p, col e -> increment 1
row e, col c -> increment 1
... and so on.

The bottleneck of my code was the vector allocation, i was 'appending' instead of pre-allocate the final vector, so the code was taking 30 minutes to execute, instead of 20 seconds!

Blue Magister · Accepted Answer

There are some built-in functions that can clean up your code. And I think you're not leveraging the full power of regular expressions.

The blaring issue here is strsplit. Comparing the equality of things character-by-character is inefficient when you have regular expressions. The pattern here uses the square bracket notation to filter for the characters you want. * is for any number of repeats (including zero), while the ^ and $ symbols represent the beginning and end of the line so that there is nothing else there. nchar(word) is the same as length(chars). Changing && to & makes the function vectorized so you can input a vector of strings and get a logical vector as output.

check.pattern.2 = function(word, min.size = 2)
{
    word = trim(word)
    grepl(paste0("^[a-z!.?]*$"),word) & nchar(word) >= min.size
}
check.pattern.2(c(" d ","!hello  ","nA!","  asdf.!"," d d "))
#[1] FALSE  TRUE FALSE  TRUE FALSE

Next, using curly braces for number of repetitions and some paste0, the pattern can use your min.size:

check.pattern.3 = function(word, min.size = 2)
{
    word = trim(word)
    grepl(paste0("^[a-z!.?]{",min.size,",}$"),word)
}
check.pattern.3(c(" d ","!hello  ","nA!","  asdf.!"," d d "))
#[1] FALSE  TRUE FALSE  TRUE FALSE

Finally, you can internalize the regex from trim:

check.pattern.4 = function(word, min.size = 2)
{
    grepl(paste0("^\s*[a-z!.?]{",min.size,",}\s*$"),word)
}
check.pattern.4(c(" d ","!hello  ","nA!","  asdf.!"," d d "))
#[1] FALSE  TRUE FALSE  TRUE FALSE

R code to check if word matches pattern

Answers (2)

Related Questions