namalfernandolk
namalfernandolk

Reputation: 9134

Why the zero-length character always remains at the end of the source string for java regex pattern a?

Pattern pattern = Pattern.compile("a?");
Matcher matcher = pattern.matcher("a");
while(matcher.find()){
   System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end());
}

Output :

0[a]1
1[]1

why this gives me two outputs while there is a single characters as the matcher.

I noticed that for this pattern it gives an zero-length always at the end of the source string. Eg : when source is "abab" it gives

0[a]1
1[]1
2[a]3
3[]3
4[]4

Upvotes: 1

Views: 267

Answers (4)

Kaz
Kaz

Reputation: 58578

Matching the empty space after the last character is not universal.

The Vim editor has this behavior:

Buffer before:

aaaa
~
~
:s/x\?/y/g  <- command

Buffer after:

yayaya
~
~

No x occurs in aaaa but the x? (written x\? in Vim by default) allows an empty match. The pattern matches the empty space at the start of the string and between all the characters, but not past the end.

The exception is if the line is empty. The command will replace a blank line with a single y.

I implemented the Vim-like behavior in my own program:

$ txr -c '@(bind result @(regsub #/x?/ "y" "aaaa"))'
result="yayayaya"

$ txr -c '@(bind result @(regsub #/x?/ "y" ""))'
result="y"

Only because Vim is popular and I can point to that as the reference model if any questions come up. But it's a bit of a hack. The logic has a do .. while loop, which allows an incoming empty string to be processed:

do {
  /* regex match, extraction, substitution ... */
  position++;
} while (position < length(input))

So if the starting position is zero and the input has length zero, we do the loop once, applying the regex to the empty string. But if we process the last character, position reaches the length and the loop terminates without processing the empty string.

Originally, I had the loop test at the top, so it was behaving like Vim, but not in the empty input case, which would not match regexes that match on empty.

The behavior of the Java class you're using might be implemented like this:

while (position <= length(input)) {
  /* process regex */
  position++;
}

Upvotes: 0

maerics
maerics

Reputation: 156434

The regex special character ? (question mark) means "match the preceding thing zero or one time".

Since you are matching in a while loop (while (matcher.find()) {...) it finds both matches of the expression - one occurrence of "a" (at position 0, the string "a") and zero occurrences of "a" (at position 1, the empty string at the very end).

So here's what your code snippet is matching (start/end indices are denoted by X/Y):

String: " a b a b "
         ├─┼─┼─┼─┤
Index:   0 1 2 3 4
Match:   ╰┬╯ ╰┬╯ ╰- the empty string 4/4 (zero occurrences of "a").
          ||  |╰- the empty string 3/3 (zero occurrences of "a").
          ||  ╰ the string "a" 2/3 (one occurrence of "a").
          |╰ the empty string 1/1 (zero occurrences of "a").
          ╰ the string "a" 0/1 (one occurrence of "a").

It doesn't match at positions 0/0 or 2/2 since the expression is greedy, which means it will try to consider the next character (at positions 0/1, 2/3) as long as it doesn't invalidate the match, which it doesn't so they are skipped. To illustrate, if you were to match the string "bbbb" against the pattern a? then you would get five empty strings, one for each empty string at the beginning, end, and between each character.

Upvotes: 3

Attila
Attila

Reputation: 28762

a? stands for 0-or-1 occurrances of the character a.

The empty string is matching the 0 occurrence.

The matching is also greedy in you case, so it matches the 1 occurrance first, then the 0 occurrance at the end.

In the abab case, think of it as a[]ba[]b[], where [] denotes the empty occurrance found. The matcher does not find it in the beginning or after the first b, because it can greedily match on a.

Upvotes: 1

xpapad
xpapad

Reputation: 4456

Have a look at

http://docs.oracle.com/javase/tutorial/essential/regex/quant.html

It explains your case in detail under the section Zero-Length Matches

Upvotes: 1

Related Questions