user1635689
user1635689

Reputation: 79

Regular expression hangs - Java matcher

String:

Aqua, Sodium Laureth Sulfate, Sodium Lauryl Sulfate, Dimethicone, Cocamide MEA, Zinc Carbonate, Glycol Distearate, Sodium Chloride, Zinc Pyrithione, Sodium Xylenesulfonate, Cetyl Alcohol, Parfum, Guar Hydroxypropyltrimonium Chloride, Magnesium Sulfate, Sodium Benzoate, Ammonium Laureth Sulfate, Magnesium Carbonate Hydroxide, Linalool, Butylphenyl Methylpropional, Limonene, Hydroxyisohexyl 3-Cyclohexene Carboxaldehyde, Benzyl Alcohol, Hexyl Cinnamal, Citronellol, Tocopheryl Acetate, Paraffinum Liquidum, Sodium Polynaphthalenesulfonate, CI 19140, DMDM Hydantoin, CI 15510, Methylchloroisothiazolinone, Disodium EDTA, Tetrasodium EDTA, Methylisothiazolinone.

Current Regex:

System.out.println(string.matches("([\\W]*\\b[A-Z\\d]\\w+\\b[\\W]*)+"));

Java application hangs up. I can't find the error in the RegEx. By googeling I found out that this could be called "catastrophic backtracking"!? The Regex should match the String if it only contains uppercase words if for example 1 word is lower case in should not match it.

Upvotes: 2

Views: 1540

Answers (3)

Moz
Moz

Reputation: 119

Follow these tips which i have been using to avoid Regex hangs.

  1. Avoid using Regex in your programs if possible. Instead you should manually create a method or function that does all those things for you which is way more faster than using Regex which internally performs a heavy complile operation before giving you those results as expected.

  2. Avoid performing heavy Regex operations inside nested running loops. Instead you should create one loop that does everything by breaking it down into stages.

For example:

int count = 0;
int stage = 1;
while (true) {
    if (stage == 1) {
        // Perform first loop operations...
        // then move to stage 2...
        stage = 2;
        //continue;
    }
    if (stage == 2) {
        // Perform second loop operations...
        // Where you can call your Regex functions if you like...
        count++;
        if (count > 2) {
            stage = 3;
            //continue;
        }
    }
    if (stage == 3) {
        // Right here you may decide to terminate the loop completely...
        break;
        // NOTE: Failure to call break; when done with the loop will cause this loop to run as an infinite loop!
    }
}

Upvotes: 0

Peter Lawrey
Peter Lawrey

Reputation: 533500

Perhaps what you had in mind was

String regex = "([A-Z][\\d\\w]+( [A-Z][-\\d\\w]+)*, )*[A-Z][-\\d\\w]+( [A-Z][-\\d\\w]+)*\\.";
System.out.println(string.matches(regex));

returns true.

The problem you have with the regex is that its overly complicated. The disadvantage with adding expressions until you get true is that it can match things you didn't have in mind.

Random rand = new Random();
while(true) {
    byte[] bytes = new byte[40];
    rand.nextBytes(bytes);
    for (int i = 0; i < bytes.length; i++) bytes[i] &= 0x7F;
    String string = new String(bytes, 0);
    if (string.matches("([\\W]*\\b[A-Z\\d]\\w+\\b[\\W\\d]*)+"))
        System.out.println(string);
}

prints things such as

"^;%XX`'SwJ|[*4"*0C<Tgbom_. \^
{PvU_y9aJSm?08EL(   NpfA9a[:$YbN8VTtMk
;![`LR7Yy\AO5PZ@X4}GajC<*XvKE11
8l5W6*IDNH[9C'@.>7`LHsCN*,{26O}
EFJ5MBVxi%W_t6v54EmLmgjFvqyYh\<"
+7]|ULh2[MT`Yx{MKH4N
'8p!2mf

whereas the expression I gave matches

KfhBuGv7, S3.
IWzu, XHop4Z.
LJbXfrd, PdR.
V2dxQV, LA9z.
HKf37cy0, TS.
RAw2E5a, Ajs.
Up-, GPQ7 I_.

Upvotes: 0

jolivier
jolivier

Reputation: 7635

I recommend you split your input string by word and then pattern match it, event simpler: not to pattern match if you just want to test that the first letter of each word is uppercase, like:

for (String s : string.split("\\W")) {
  if (s.charAt(0) < 'A' || s.charAt(0) > 'Z') {
    return false;
  }
}

Sounds a lot faster to me (and you can even have the word that failed if you need).

Upvotes: 1

Related Questions