Reputation: 129507
I'm trying to use a regular expression to parse a file by extracting certain pieces of text. The regular expressions I need to use are not supported by the standard java.util.regex
packages (since I need to match nested constructs, such as nested {}
brackets and other similar things), so I decided to try JRegex
, which claims to fully handle Perl 5.6 regex syntax. However, I ran into a problem when trying to use this package with a recursive regex to match the nested {}
brackets:
Pattern p = new Pattern("(\\{(?:(?1)*|[^{}]*)+\\}|\\w+)"); // jregex.Pattern
Exception in thread "main" jregex.PatternSyntaxException: wrong char after "(?": 1
The analogous regex /(\{(?:(?1)*|[^{}]+)+\}|\w+)/sg
works as expected in Perl, however. So, my next idea was to find a way to parse the file in Perl and then pass the results to Java (preferably in the form of a string array or something similar), and my question is: what is the best way to do that in this case? Or, is there another simpler alternative that I am overlooking?
Upvotes: 0
Views: 3293
Reputation: 13631
JRegex does not seem to support recursive matching, so I suggest you just use java.util.regex
and set a limit upon the number of levels of nesting.
For example, to allow up to fifty levels of nesting, with an 'unlimited' number of bracket pairs on each level (except the deepest), you could use
// Set the maximum number of nested levels required.
int max = 50;
String regex = "(?R)";
while (--max > 0) {
regex = regex.replace("(?R)", "(?>\\{(?:[^{}]*+|(?R))+\\})");
}
// Ensure no (?R) in the final and deepest replacement.
regex = regex.replace("(?R)", "\\{[^{}]*+\\}") + "|\\w+";
String str = " {{}{}} {abc} {{de}{fg}} hij {1{2{3{4{5{6{7{8{9{10{11{12{13{14{15{16{17{18{19{20{21{22{23{24{25{26{27{28{29{30{31{32{33{34{35{36{37{38{39{40{41{42{43{44{45{46{47{48{49{50}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}} {end}";
Matcher m = Pattern.compile(regex).matcher(str);
while (m.find()) {
System.out.println(m.group());
}
/*
{{}{}}
{abc}
{{de}{fg}}
hij
{1{2{3{4{5{6{7{8{9{10{11{12{13{14{15{16{17{18{19{20{21{22{23{24{25{26{27{28{29{30{31{32{33{34{35{36{37{38{39{40{41{42{43{44{45{46{47{48{49{50}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}
{end}
*/
The above builds a regular expression by taking one that could be used if recursive matching was supported (?>\\{(?:[^{}]*+|(?R))+\\})
and repeatedly substituting the (?R)
for the whole pattern.
Because there are many nested quantifiers in the expression that is created, atomic grouping (?>)
and the possessive quantifier +
are used to limit backtracking and ensure that the regex fails fast if it cannot find a match. Although the regex may be long, it will be efficient.
If you don't want or are unable to set a limit on the nesting, or if the idea of a lengthy regex is worrying, you could parse the nested brackets by simply iterating over the file text and tracking the number of opening and closing brackets, for example
List<String> list = new ArrayList<String>();
int strLen = str.length();
for (int i = 0; i < strLen; i++) {
char c = str.charAt(i);
if (c == '{') {
int b = 1;
StringBuilder sb = new StringBuilder("{");
while (b > 0 && i < strLen - 1) {
sb.append( c = str.charAt(++i) );
if (c == '}') b--;
else if (c == '{') b++;
}
list.add(sb.toString());
}
}
for (String s : list) { System.out.println(s); }
That seems like a lot less trouble than interacting with Perl, but see answers such as How should I call a Perl Script in Java? if that is what you want to do.
Upvotes: 3
Reputation: 9577
The best way is to tokenize the input and send it through a token-stream to your parser then parse it top-down/bottm-up depending on your needs. Regex is not always helpful in parsing nested structures.
The JLex utility is based upon the Lex lexical analyzer generator model. JLex takes a specification file similar to that accepted by Lex, then creates a Java source file for the corresponding lexical analyzer.
Have a look on JLex as it may help you generating lexical analyzer for your case out of very simple code.
Upvotes: 1
Reputation: 6089
Regex can't really handle nested delimiters. I've approached this in the past by using a regex to find the delimiters and then using a simple Finite State Machine to parse the resulting array.
Upvotes: 0