Joao Silva
Joao Silva

Reputation: 876

C# regex to match words containing known substrings and not equal to specific keywords

I need to verify if a string contains "error" or "exception" in it, excluding certain keywords: "exception1", "exception2", "includeException", "error1".

This regex seems to do the job:

\b\w*(?!exception1)(?!exception2)(?!includeException)(?!error1)(exception|error)\w*\b

It correctly returns 2 matches when run against the following string:

Test string: "exception1 exception2 exception3 includeException error1 error2"
Matches: "exception3", "error2"

However, if I set the RegexOptions.IgnoreCase flag or add "(?i)" at the beginning of the Regex it also returns a match for "includeException".

What am I missing here?

Upvotes: 1

Views: 98

Answers (3)

pabrams
pabrams

Reputation: 1164

Regex is not very readable... how about a pure C# solution?

public static Boolean ContainsErrorOrExceptionExcept(this string input, string[] excludedKeywords)
{
    if (input.Contains("error") || input.Contains("exception"))
    {
        foreach (string x in excludedKeywords)
        {
            if (input.Contains(x))
            {
                return false;
            }
        }
        return true;
    }
    else
    {
        return false;
    }       
}

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627100

I see two main bottlenecks with your regex:

  • It has several unanchored lookaheads (when unanchored, they usually do not help unless used in a tempered greedy token and other complex patterns)
  • The \w* subpatterns are placed on both sides of lookaheads, thus, removing any impact from the lookaheads.

The problem with case-insensitivity is described in Berin's answer, you want to match the word exception and includeException contains that substring. So, a possible solution is to add a leading word boundary to (error|exception) pattern:

\b\w*(?!exception1)(?!exception2)(?!includeException)(?!error1)\b(exception|error)\w*\b
                                                               ^^

However, if you need to match words containing error or exception, that ARE NOT EQUAL to specific keywords, use

\b(?!(?:exception1|exception2|includeException|error1)\b)\w*(exception|error)\w*\b

Here, the lookaheads are anchored to the leading word boundary, they are only checked once after each word boundary, not at each position inside a word. Certainly, you can contract it further: \b(?!(?:exception[12]|includeException|error1)\b)\w*(exception|error)\w*\b.

Now, if you need to match words containing error or exception, that DO NOT CONTAIN specific keywords, use

\b(?!\w*(?:exception1|exception2|includeException|error1))\w*(exception|error)\w*\b

All regex patterns used here are tested at regexhero.net

Upvotes: 2

Berin Loritsch
Berin Loritsch

Reputation: 11463

Using a good Regex tester can help you figure out what's actually being matched. I used this one:

http://regexhero.net/tester/

In the results where it highlights the matches, there is a small button with an 'i' for information. So the reason that it's matching innerException when it's case insensitive is because you are matching the latter half of the word. The regex doesn't require white space separating the words.

Your regex would match with case invariant off if innerException were written as innerexception because your positive match (exception|error) is matching the last half. You can also see that when you start removing spaces. exception1exception2 doesn't match, but exception1exception2exception3 does.

While Regex is very compact, there are several ways to get it wrong. A straightforward approach might be a better solution in this case.

Changing your regex to remove the last wildcard * characters will make what you have work the way you want:

\b\w*(?!exception1)(?!exception2)(?!includeException)(?!error1)(exception|error)\w\b

Upvotes: 3

Related Questions