blend
blend

Reputation: 132

Why is java.util.regex.Matcher start() and end() returning extra characters in this case?

I'm generating some regexes dynamically and replacing the matched results with another string post-hoc in my application. I'm taking the start and end indices of a match then replacing each matched chunk of characters one by one and then adjusting the offsets for the next matches. However in one match out of other several successfully matched/replaced cases I noticed that my start and end indices include an extra character.

Here is the code I'm using to generate the regexes:

Pattern.compile("[^a-zA-Z]+(?<match>" + Pattern.quote(search[i]) + ")[^a-zA-Z]+")

Where in the case that's adding an extra character

search[i] = "on a daily basis"

The resulting regex

[^a-zA-Z]+(?<match>\Qon a daily basis\E)[^a-zA-Z]+

This is the relevant text that's being matched against

to on a daily basis.

My desired output is

on a daily basis

This is the output I get from matcher.group("match"), however when I debug the start() and end() results from the same matcher context I get 356 and 375 respectively (this is in the context of the full text), but you can see that the difference between those two numbers is 19 while the string "on a daily basis" is only 16 characters.

I'm assuming that I need to account for the \Q and \E from Pattern.quote? But then where is the third extra additional character coming from? And why does this only occur in this pattern/target string case specifically?

Is there some other unrelated cause of the discrepancy that I'm overlooking?

Upvotes: 2

Views: 2144

Answers (1)

Rohit Jain
Rohit Jain

Reputation: 213233

The result is as expected. You didn't consider the [^a-zA-Z]+ at the start and end of your pattern. So, though the length of actual text is 16, total length of matched string would be different.

Though the Matcher#group(String) will return the text matched in that group, the Matcher#start() will give the start index of the complete match. Same for end() method. It will give index of 1 past the last index of matched string.

If you want to get start and end index of matched group, you can pass the group name to both start(String) and end(String) method.

Try this out in a small string, and you'll get to know.

String search = "on a daily basis";
String toMatch = "to on a daily basis. ";
Pattern pattern = Pattern.compile("[^a-zA-Z]+(?<match>" + Pattern.quote(search) + ")[^a-zA-Z]+");

Matcher matcher = pattern.matcher(toMatch);

if (matcher.find()) {
  System.out.println(matcher.group().length());
  System.out.println(matcher.start());
  System.out.println(matcher.end());

  System.out.println(matcher.group("match").length());
  System.out.println(matcher.start("match")); // your expected result
  System.out.println(matcher.end("match")); 
}

So in above example, the length of group is different from the length of complete match (which is what you get as result).

Upvotes: 6

Related Questions