Reputation: 479
I'm trying to capture nested optional groups in Java but it's not working out.
I'm trying to capture a keyword followed by an interval, where a keyword is anything for now, and an interval is just two dates. The interval may be optional, and the two dates may be optional as well. So, the following are valid matches.
I want to capture the keyword and both the dates even if they are null.
This is the Java MWE I've came up with.
public class Parser {
public static void main(String[] args) {
Parser parser = new Parser();
String s = "word [01/01/1900, 01/01/2000]";
parser.parse(s);
}
public void parse(String s) {
String date = "\\d{2}/\\d{2}/\\d{4}";
String interval = "\\[("+date+")?, ("+date+")?\\]";
String keyword = "(.+)( "+interval+")?";
Pattern p = Pattern.compile(keyword);
Matcher m = p.matcher(s);
if (m.matches()) {
for (int i = 0; i <= m.groupCount(); ++i) {
System.out.println(i + ": " + m.group(i));
}
}
}
}
And this is the output
0: word [01/01/1900, 01/01/2000]
1: word [01/01/1900, 01/01/2000]
2: null
3: null
4: null
If interval isn't optional, then it works.
String keyword = "(.+)( "+interval+")";
0: word [01/01/1900, 01/01/2000]
1: word
2: [01/01/1900, 01/01/2000]
3: 01/01/1900
4: 01/01/2000
If interval is a non-matching group (but still optional), then it doesn't work.
String keyword = "(.+)(?: "+interval+")?";
0: word [01/01/1900, 01/01/2000]
1: word [01/01/1900, 01/01/2000]
2: null
3: null
What do I need to do to retrieve back both dates? Thank You.
Edit: Part 2.
Suppose now I watch to match repeated keywords. i.e. the regex, keyword(, keyword)*
. I tried this out, but only the first and the last instance is captured.
For simplicity, suppose I want to match the following a, b, c, d
with the regex ([a-z])(?:, ([a-z]))*
However, I can only retrieve back the first and last group.
0: a, b, c, d
1: a
2: d
Why is this so?
Just found out that this cannot be done. Capture group multiple times
Upvotes: 2
Views: 1211
Reputation: 31699
Change the first part of keyword
from (.+)
to (.+?)
.
Without the ?
, the (.+)
is a greedy quantifier. That means it will try to match as much as it can. I don't know all the mechanics of how the regex engine works, but I believe that in your case, what it's doing is setting some counter N
to the number of characters remaining in the source. If it can use up that many characters and get the whole regex to match, it will. Otherwise, it tries N-1
, N-2
, etc., until the entire regex matches. I also think it goes from left to right when trying this; that is, since (.+)
is the leftmost "part" of the pattern (for some definition of "part"), it loops on that part before it tries any looping on parts that are to the right. Thus, it's more important to make (.+)
greedy than to make any other part of the pattern greedy; the (.+)
takes precedence.
In your case, since (.+)
is followed by an optional part, the regex matcher starts by trying the entire remainder of the string--and it succeeds, because the rest of the string, which is empty, is a fine match for an optional substring. That should also explains why it doesn't work if your substring isn't optional--the empty substring no longer matches.
Adding ?
makes it a "reluctant" (or "stingy") quantifier, which works in the opposite direction. It starts by seeing if it can make a match with 0 characters, then 1, 2, ..., instead of starting with N and going downward. So when it gets up to 5, matching "word "
, and it finds that the rest of the string matches your optional part, it completes and gives the results you were expecting.
Upvotes: 1