Woody
Woody

Reputation: 844

Very slow Regular Expression in Java

Using Java, i want to detect if a line starts with words and separator then "myword", but this regex takes too long. What is incorrect ?

^\s*(\w+(\s|/|&|-)*)*myword

Upvotes: 0

Views: 515

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

The pattern ^\s*(\w+(\s|/|&|-)*)*myword is not efficient due to the nested quantifier. \w+ requires at least one word character and (\s|/|&|-)* can match zero or more of some characters. When the * is applied to the group and the input string has no separators in between word characters, the expression becomes similar to a (\w+)* pattern that is a classical catastrophical backtracking issue pattern.

Just a small illustration of \w+ and (\w+)* performance:

\w+:                                                 (\w+)*

enter image description here enter image description here

You pattern is even more complicated and invloves more those backtracking steps. To avoid such issues, a pattern should not have optional subpatterns inside quantified groups. That is, create a group with obligatory subpatterns and apply the necessary quantifier to the group.

In this case, you can unroll the group you have as

String rx = "^\\s*(\\w+(?:[\\s/&-]+\\w+)*)[\\s/&-]+myword";

See IDEONE demo

Here, (\w+(\s|/|&|-)*)* is unrolled as (\w+(?:[\s/&-]+\w+)*) (I kept the outer parentheses to produce a capture group #1, you may remove these brackets if you are not interested in them). \w+ matches one or more word characters (so, it is an obligatory subpatter), and the (?:[\s/&-]+\w+)* subpattern matches zero or more (*, thus, this whole group is optional) sequences of one or more characters from the defined character class [\s/&-]+ (so, it is obligatory) followed with one or more word characters \w+.

Upvotes: 4

Related Questions