Fernando
Fernando

Reputation: 7905

R code to check if word matches pattern

I need to validate a string against a character vector pattern. My current code is:

trim <- function (x) gsub("^\\s+|\\s+$", "", x)

# valid pattern is lowercase alphabet, '.', '!', and '?' AND
# the string length should be >= than 2
my.pattern = c(letters, '!', '.', '?')

check.pattern = function(word, min.size = 2)
{
    word = trim(word)
    chars = strsplit(word, NULL)[[1]]
    all(chars %in% my.pattern) && (length(chars) >= min.size)
}

Example:

w.valid = 'special!'
w.invalid = 'test-me'

check.pattern(w.valid) #TRUE
check.pattern(w.invalid) #FALSE

This is VERY SLOW i guess...is there a faster way to do this? Regex maybe? Thanks!

PS: Thanks everyone for the great answers. My objective was to build a 29 x 29 matrix, where the row names and column names are the allowed characters. Then i iterate over each word of a huge text file and build a 'letter precedence' matrix. For example, consider the word 'special', starting from the first char:

row s, col p -> increment 1
row p, col e -> increment 1
row e, col c -> increment 1
... and so on.

The bottleneck of my code was the vector allocation, i was 'appending' instead of pre-allocate the final vector, so the code was taking 30 minutes to execute, instead of 20 seconds!

Upvotes: 3

Views: 6189

Answers (2)

Ian T. Small
Ian T. Small

Reputation: 298

If I understand the pattern you are desiring correctly, you would want a regex of a similar format to:

^\\s*[a-z!\\.\\?]{MIN,MAX}\\s*$

Where MIN is replaced with the minimum length of the string, and MAX is replaced with the maximum length of the string. If there is no maximum length, then MAX and the comma can be omitted. Likewise, if there is neither maximum nor minimum everything within the {} including the braces themselves can be replaced with a * which signifies the preceding item will be matched zero or more times; this is equivalent to {0}.

This ensures that the regex only matches strings where every character after any leading and trailing whitespace is from the set of * a lower case letter * a bang (exclamation point) * a question mark

Note that this has been written in Perl style regex as it is what I am more familiar with; most of my research was at this wiki for R text processing.

The reason for the slowness of your function is the extra overhead of splitting the string into a number of smaller strings. This is a lot of overhead in comparison to a regex (or even a manual iteration over the string, comparing each character until the end is reached or an invalid character is found). Also remember that this algorithm ENSURES a O(n) performance rate, as the split causes n strings to be generated. This means that even FAILING strings must do at least n actions to reject the string.

Hopefully this clarifies why you were having performance issues.

Upvotes: 1

Blue Magister
Blue Magister

Reputation: 13363

There are some built-in functions that can clean up your code. And I think you're not leveraging the full power of regular expressions.

The blaring issue here is strsplit. Comparing the equality of things character-by-character is inefficient when you have regular expressions. The pattern here uses the square bracket notation to filter for the characters you want. * is for any number of repeats (including zero), while the ^ and $ symbols represent the beginning and end of the line so that there is nothing else there. nchar(word) is the same as length(chars). Changing && to & makes the function vectorized so you can input a vector of strings and get a logical vector as output.

check.pattern.2 = function(word, min.size = 2)
{
    word = trim(word)
    grepl(paste0("^[a-z!.?]*$"),word) & nchar(word) >= min.size
}
check.pattern.2(c(" d ","!hello  ","nA!","  asdf.!"," d d "))
#[1] FALSE  TRUE FALSE  TRUE FALSE

Next, using curly braces for number of repetitions and some paste0, the pattern can use your min.size:

check.pattern.3 = function(word, min.size = 2)
{
    word = trim(word)
    grepl(paste0("^[a-z!.?]{",min.size,",}$"),word)
}
check.pattern.3(c(" d ","!hello  ","nA!","  asdf.!"," d d "))
#[1] FALSE  TRUE FALSE  TRUE FALSE

Finally, you can internalize the regex from trim:

check.pattern.4 = function(word, min.size = 2)
{
    grepl(paste0("^\\s*[a-z!.?]{",min.size,",}\\s*$"),word)
}
check.pattern.4(c(" d ","!hello  ","nA!","  asdf.!"," d d "))
#[1] FALSE  TRUE FALSE  TRUE FALSE

Upvotes: 6

Related Questions