A Gilani
A Gilani

Reputation: 443

Java Regex hung on a long string

I am trying to write a REGEX to validate a string. It should validate to the requirement which is that it should have only Uppercase and lowercase English letters (a to z, A to Z) (ASCII: 65 to 90, 97 to 122) AND/OR Digits 0 to 9 (ASCII: 48 to 57) AND Characters - _ ~ (ASCII: 45, 95, 126). Provided that they are not the first or last character. It can also have Character. (dot, period, full stop) (ASCII: 46) Provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively. I have tried using the following

Pattern.compile("^[^\\W_*]+((\\.?[\\w\\~-]+)*\\.?[^\\W_*])*$");

It works fine for smaller strings but it doesn't for long strings as i am experiencing thread hung issues and huge spikes in cpu. Please help.

Test cases for invalid strings:

"aB78."
"aB78..ab"
"aB78,1"
"aB78 abc"
".Abc12"

Test cases for valid strings:

"abc-def"
"a1b2c~3"
"012_345"

Upvotes: 2

Views: 1955

Answers (4)

nhahtdh
nhahtdh

Reputation: 56809

Problem

It is due to catastrophic backtracking. Let me show where it happens, by simplifying the regex to a regex which matches a subset of the original regex:

^[^\W_*]+((\.?[\w\~-]+)*\.?[^\W_*])*$

Since [^\W_*] and [\w\~-] can match [a-z], let us replace them with [a-z]:

^[a-z]+((\.?[a-z]+)*\.?[a-z])*$

Since \.? are optional, let us remove them:

^[a-z]+(([a-z]+)*[a-z])*$

You can see ([a-z]+)*, which is the classical example of regex which causes catastrophic backtracking (A*)*, and the fact that the outermost repetition (([a-z]+)*[a-z])* can expand to ([a-z]+)*[a-z]([a-z]+)*[a-z]([a-z]+)*[a-z] further exacerbate the problem (imagine the number of permutation to split the input string to match all expansions that your regex can have). And this is not mentioning [a-z]+ in front, which adds insult to injury, since it is of the form A*A*.

Solution

You can use this regex to validate the string according to your conditions:

^(?=[a-zA-Z0-9])[a-zA-Z0-9_~-]++(\.[a-zA-Z0-9_~-]++)*+(?<=[a-zA-Z0-9])$

As Java string literal:

"^(?=[a-zA-Z0-9])[a-zA-Z0-9_~-]++(\\.[a-zA-Z0-9_~-]++)*+(?<=[a-zA-Z0-9])$"

Breakdown of the regex:

^                                      # Assert beginning of the string
(?=[a-zA-Z0-9])                        # Must start with alphanumeric, no special
[a-zA-Z0-9_~-]++(\.[a-zA-Z0-9_~-]++)*+
(?<=[a-zA-Z0-9])                       # Must end with alphanumeric, no special
$                                      # Assert end of the string

Since . can't appear consecutively, and can't start or end the string, we can consider it a separator between strings of [a-zA-Z0-9_~-]+. So we can write:

[a-zA-Z0-9_~-]++(\.[a-zA-Z0-9_~-]++)*+

All quantifiers are made possessive to reduce stack usage in Oracle's implementation and make the matching faster. Note that it is not appropriate to use them everywhere. Due to the way my regex is written, there is only one way to match a particular string to begin with, even without possessive quantifier.

Shorthand

Since this is Java and in default mode, you can shorten a-zA-Z0-9_ to \w and [a-zA-Z0-9] to [^\W_] (though the second one is a bit hard for other programmer to read):

^(?=[^\W_])[\w~-]++(\.[\w~-]++)*+(?<=[^\W_])$

As Java string literal:

"^(?=[^\\W_])[\\w~-]++(\\.[\\w~-]++)*+(?<=[^\\W_])$"

If you use the regex with String.matches(), the anchors ^ and $ can be removed.

Upvotes: 2

laune
laune

Reputation: 31290

A solution should try and find negatives rather than try and match a pattern over the entire string.

Pattern bad = Pattern.compile( "[^-\\W.~]|\\.\\.|^\\.|\\.$" );
for( String str: new String[]{ "aB78.", "aB78..ab", "abcdef",
        "aB78,1", "aB78 abc" } ){
    Matcher mat = bad.matcher( str );
    System.out.println( mat.find() );
}

(It is remarkable to see how the initial statement "string...should have only" leads programmers to try and create positive assertions by parsing or matching valid characters over the full length rather than the much simpler search for negatives.)

Upvotes: 1

Bohemian
Bohemian

Reputation: 425003

Your regex suffers from catastrophic backtracking, which leads to O(2n) (ie exponential) solution time.

Although following the link will provide a far more thorough explanation, briefly the problem is that when the input doesn't match, the engine backtracks the first * term to try different combinations of the quantitys of the terms, but because all groups more or less match the same thing, the number of combinations of ways to group grows exponentially with the length of the backtracking - which in the case of non- matching input is the entire input.

The solution is to rewrite the regex so it won't catastrophically backtrack:

  • don't use groups of groups
  • use possessive quantifiers eg .*+ (which never backtrack)
  • fail early on non-match (eg using an anchored negative look ahead)
  • limit the number of times terms may appear using {n,m} style quantifiers

Or otherwise mitigate the problem

Upvotes: 3

Niels Billen
Niels Billen

Reputation: 2199

As @MarounMaroun already commented, you don't really have a pattern. It might be better to iterate over the string as in the following method:

public static boolean validate(String string) {
    char chars[] = string.toCharArray();

    if (!isSpecial(chars[0]) && !isLetterOrDigit(chars[0]))
        return false;
    if (!isSpecial(chars[chars.length - 1])
            && !isLetterOrDigit(chars[chars.length - 1]))
        return false;
    for (int i = 1; i < chars.length - 1; ++i)
        if (!isPunctiation(chars[i]) && !isLetterOrDigit(chars[i])
                && !isSpecial(chars[i]))
            return false;
    return true;
}

public static boolean isPunctiation(char c) {
    return c == '.' || c == ',';
}

public static boolean isSpecial(char c) {
    return c == '-' || c == '_' || c == '~';
}

public static boolean isLetterOrDigit(char c) {
    return (Character.isDigit(c) || (Character.isLetter(c) && (Character
            .getType(c) == Character.UPPERCASE_LETTER || Character
            .getType(c) == Character.LOWERCASE_LETTER)));
}

Test code:

public static void main(String[] args) {
    System.out.println(validate("aB78."));
    System.out.println(validate("aB78..ab "));
    System.out.println(validate("abcdef"));
    System.out.println(validate("aB78,1"));
    System.out.println(validate("aB78 abc"));
}

Output:

false
false
true
true
false

Upvotes: 2

Related Questions