Reputation: 256
I have regexp which parses all names of used freemarker macros in template (for example from <@macroName />
I need only macroName
). Templates are usually quite large (round 30 thousand characters).
Java code with regex looks like:
Pattern pattern = Pattern.compile(".*?<@(.*?)[ /].*?",
Pattern.DOTALL | Pattern.UNIX_LINES);
Matcher matcher = pattern.matcher(inputText);
while(matcher.find()){
//... some code
}
But sometimes happens that I get this exception:
java.util.regex.Pattern$Curly.match1(Pattern.java:3814)
java.util.regex.Pattern$Curly.match(Pattern.java:3763)
java.util.regex.Pattern$Start.match(Pattern.java:3072)
java.util.regex.Matcher.search(Matcher.java:1116)
java.util.regex.Matcher.find(Matcher.java:552)
...
Does anybody know why it happens or could anybody make me sure if the regexp I'm using is optimized well? thank you
Upvotes: 1
Views: 1057
Reputation: 75222
You can get rid of the leading .*?
because you don't need to consume the text before/between the matches. The regex engine will take care of scanning for the next match, and it will do it a lot more efficiently than what you're doing. Just give it the pattern for the tag itself and get out of its way.
You can get rid of the trailing .*?
because it never does anything. Think about it: it's trying to match zero or more of any characters, reluctantly. That means the first thing it tries to do is match nothing. That attempt will succeed (it's always possible to match nothing), so it never tries to consume more characters.
You probably want something like this ():
<@(\w+)[\s/]
...or in Java-speak:
Pattern p= Pattern.compile("<@(\\w+)[ /]");
You don't need DOTALL (no dots) or any other modifiers.
Upvotes: 1
Reputation: 6203
For <@macro macroName />
your regex looks a little bit convoluted. Either there are things (special cases) that <@macro macroName />
don't describe, or the regex is trying too hard. Try:
<@macro\s+(\S+)\s+/>
You should have now the macro's name in group #1.
Upvotes: 3