namalfernandolk
namalfernandolk

Reputation: 9134

Zero-length matches in Java Regex

My code :

Pattern pattern = Pattern.compile("a?");
Matcher matcher = pattern.matcher("ababa");
while(matcher.find()){
   System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end());
}

Output :

0[a]1
1[]1
2[a]3
3[]3
4[a]5
5[]5

What I know :

Java API says :

What I want to know:

  1. In which situations does the regex engine encounters a zero occurrence of a given character(s) - Here for character 'a'.
  2. In those situation what are values actually returns by the start(), end() and group() methods in the matcher. I have mentioned what the java API said. But I'm little unclear when it comes to the practical situation as above.

Upvotes: 8

Views: 3863

Answers (2)

Rohit Bansal
Rohit Bansal

Reputation: 1199

Iterating over few examples would clear out the functioning of matcher.find() to you :

Regex engine takes on one character from string (i.e. ababa) and tries to find if pattern you are seeking in string could be found or not. If the pattern exists, then (as API mentioned) :

matcher.start() returns the starting index, matcher.end() returns the offset after the last character matched.

If match do not exists. then start() and end() returns the same index, which is to comply the length matched is zero.

Look down following examples :

        // Searching for string either "a" or ""
        Pattern pattern = Pattern.compile("a?");
        Matcher matcher = pattern.matcher("abaabbbb");
        while(matcher.find()){
           System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end());
        }

Output:

    0[a]1
    1[]1
    2[a]3
    3[a]4
    4[]4
    5[]5
    6[]6
    7[]7
    8[]8


      // Searching for string either "aa" or "a"
       Pattern pattern = Pattern.compile("aa?");
    Matcher matcher = pattern.matcher("abaabbbb");
    while(matcher.find()){
       System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end());
    }

Output:

0[a]1
2[aa]4

Upvotes: 3

Guillaume Polet
Guillaume Polet

Reputation: 47608

The ? is a greedy quantifier, therefore it will first try to match the 1-occurence before trying the 0-occurence. In you string,

  1. it starts with the first char 'a' and tries to match agains the 1-occurence. The 'a' char matches and so it returns the first result you see
  2. then it moves forward and find a 'b'. The 'b' char does not match your regexp 1-occurence, so the engine backtracks and attempt to match a 0-occurence. Result is that the empty string is matched--> you get your second result.
  3. then it moves ahead of b since no more matches are possible there and it starts again with your second 'a' char.
  4. etc... you get the point...

It is a bit more complicated than that but that is the main idea. When the 1-occurence cannot match, it will then try with the 0-occurence.

As for the values of start, end and group, they will be where the match starts, ends and the group is what has been matched, so in the first 0-occurence match of your string, you get 1, 1 and the emtpy string. I am not sure this really answers your question.

Upvotes: 11

Related Questions