Reputation: 9134
Pattern pattern = Pattern.compile("a?");
Matcher matcher = pattern.matcher("a");
while(matcher.find()){
System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end());
}
Output :
0[a]1
1[]1
why this gives me two outputs while there is a single characters as the matcher.
I noticed that for this pattern it gives an zero-length always at the end of the source string. Eg : when source is "abab" it gives
0[a]1
1[]1
2[a]3
3[]3
4[]4
Upvotes: 1
Views: 267
Reputation: 58578
Matching the empty space after the last character is not universal.
The Vim editor has this behavior:
Buffer before:
aaaa
~
~
:s/x\?/y/g <- command
Buffer after:
yayaya
~
~
No x
occurs in aaaa
but the x?
(written x\?
in Vim by default) allows an empty match. The pattern matches the empty space at the start of the string and between
all the characters, but not past the end.
The exception is if the line is empty. The command will replace a blank line with a single y
.
I implemented the Vim-like behavior in my own program:
$ txr -c '@(bind result @(regsub #/x?/ "y" "aaaa"))'
result="yayayaya"
$ txr -c '@(bind result @(regsub #/x?/ "y" ""))'
result="y"
Only because Vim is popular and I can point to that as the reference model if any questions come up. But it's a bit of a hack. The logic has a do .. while
loop, which allows an incoming empty string to be processed:
do {
/* regex match, extraction, substitution ... */
position++;
} while (position < length(input))
So if the starting position is zero and the input has length zero, we do the loop once, applying the regex to the empty string. But if we process the last character, position reaches the length and the loop terminates without processing the empty string.
Originally, I had the loop test at the top, so it was behaving like Vim, but not in the empty input case, which would not match regexes that match on empty.
The behavior of the Java class you're using might be implemented like this:
while (position <= length(input)) {
/* process regex */
position++;
}
Upvotes: 0
Reputation: 156434
The regex special character ?
(question mark) means "match the preceding thing zero or one time".
Since you are matching in a while loop (while (matcher.find()) {...
) it finds both matches of the expression - one occurrence of "a" (at position 0, the string "a") and zero occurrences of "a" (at position 1, the empty string at the very end).
So here's what your code snippet is matching (start/end indices are denoted by X/Y
):
String: " a b a b "
├─┼─┼─┼─┤
Index: 0 1 2 3 4
Match: ╰┬╯ ╰┬╯ ╰- the empty string 4/4 (zero occurrences of "a").
|| |╰- the empty string 3/3 (zero occurrences of "a").
|| ╰ the string "a" 2/3 (one occurrence of "a").
|╰ the empty string 1/1 (zero occurrences of "a").
╰ the string "a" 0/1 (one occurrence of "a").
It doesn't match at positions 0/0 or 2/2 since the expression is greedy, which means it will try to consider the next character (at positions 0/1, 2/3) as long as it doesn't invalidate the match, which it doesn't so they are skipped. To illustrate, if you were to match the string "bbbb"
against the pattern a?
then you would get five empty strings, one for each empty string at the beginning, end, and between each character.
Upvotes: 3
Reputation: 28762
a?
stands for 0-or-1 occurrances of the character a
.
The empty string is matching the 0 occurrence.
The matching is also greedy in you case, so it matches the 1 occurrance first, then the 0 occurrance at the end.
In the abab
case, think of it as a[]ba[]b[]
, where [] denotes the empty occurrance found. The matcher does not find it in the beginning or after the first b
, because it can greedily match on a
.
Upvotes: 1
Reputation: 4456
Have a look at
http://docs.oracle.com/javase/tutorial/essential/regex/quant.html
It explains your case in detail under the section Zero-Length Matches
Upvotes: 1