Daniel
Daniel

Reputation: 256

Java regex matcher.find fails occasionally

I have regexp which parses all names of used freemarker macros in template (for example from <@macroName /> I need only macroName). Templates are usually quite large (round 30 thousand characters). Java code with regex looks like:

Pattern pattern = Pattern.compile(".*?<@(.*?)[ /].*?", 
                                  Pattern.DOTALL | Pattern.UNIX_LINES);
Matcher matcher = pattern.matcher(inputText);
while(matcher.find()){
    //... some code
}

But sometimes happens that I get this exception:

java.util.regex.Pattern$Curly.match1(Pattern.java:3814)
java.util.regex.Pattern$Curly.match(Pattern.java:3763)
java.util.regex.Pattern$Start.match(Pattern.java:3072)
java.util.regex.Matcher.search(Matcher.java:1116)
java.util.regex.Matcher.find(Matcher.java:552)
...

Does anybody know why it happens or could anybody make me sure if the regexp I'm using is optimized well? thank you

Upvotes: 1

Views: 1057

Answers (2)

Alan Moore
Alan Moore

Reputation: 75222

You can get rid of the leading .*? because you don't need to consume the text before/between the matches. The regex engine will take care of scanning for the next match, and it will do it a lot more efficiently than what you're doing. Just give it the pattern for the tag itself and get out of its way.

You can get rid of the trailing .*? because it never does anything. Think about it: it's trying to match zero or more of any characters, reluctantly. That means the first thing it tries to do is match nothing. That attempt will succeed (it's always possible to match nothing), so it never tries to consume more characters.

You probably want something like this ():

<@(\w+)[\s/]

...or in Java-speak:

Pattern p= Pattern.compile("<@(\\w+)[ /]");

You don't need DOTALL (no dots) or any other modifiers.

Upvotes: 1

dda
dda

Reputation: 6203

For <@macro macroName /> your regex looks a little bit convoluted. Either there are things (special cases) that <@macro macroName /> don't describe, or the regex is trying too hard. Try:

<@macro\s+(\S+)\s+/>

You should have now the macro's name in group #1.

Upvotes: 3

Related Questions