Reputation: 111
I am working with a regex and I want to use it on the replaceAll
method of the String class in Java.
My regex works fine and groupCount()
returns 11. So, when I try to replace my text using backreference pointing to the eleventh group, I am getting the first group with a "1" attached to it, instead of the group eleven.
String regex = "(>[^<]*?)((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)([^<]*<)";
String text = "<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>":
String replacement = text.replaceAll(regex, $1<a href="tel:$2">$2</a>$11");
I am expecting to get the following result:
<span style=\"font-size:11.0pt\"><a href=\"tel:675-441-3144;;;78888464#\">675-441-3144;;;78888464#</a><o:p></o:p></span>
But the $11 backreference is not returning the 11th group, it is returning the first group with a 1 attached to it, and instead I am getting the following result:
<span style="font-size:11.0pt"><a href="tel:675-441-3144">675-441-3144</a>>1o:p></o:p></span>
Can someone please tell me how to access the eleventh group of my pattern?
Thanks.
Upvotes: 0
Views: 1014
Reputation: 2188
The way you access the eleventh group of a match in the replacement is with $11
.
As the corresponding Javadoc* states:
The replacement string may contain references to subsequences captured during the previous match: Each occurrence of
${name}
or$g
will be replaced by the result of evaluating the corresponding group(name) or group(g) respectively. For$g
, the first number after the$
is always treated as part of the group reference. Subsequent numbers are incorporated intog
if they would form a legal group reference.
So generally speaking, as long as have at least eleven groups, then "$11"
will evaluate to group(11)
. However, if you do not have at least eleven groups, then "$11"
will evaluate to group(1) + "1"
.
* This quote is from Matcher#appendReplacement(StringBuffer,String)
, which is where the chain of relevant citations from String#replaceAll(String,String)
leads to.
Your regex does not do what you think it does.
Let's divide your regex into its three top-level groups. These are groups 1, 2, and 11, respectively.
(>[^<]*?)
((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)
([^<]*<)
Group 2 is the main body of your regex, and it consists of a top-level alternation over two options. These two options consist of groups 3-8 and 9-10, respectively.
((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})
(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)
Now, given the text
string, here is what is going on:
">"
."675-441-3144"
."675-441-3144"
."675-441-3144"
, which is immediately before ";;;78888464#"
."<"
, which is all of ";;;78888464#<"
.Thus, some of the content that you want to be in group 2 is actually in group 11 instead.
Do both of the following two things:
Convert the contents of group 2 from
option1|option2
to
option1(option2)?|option2
Change $11
in your replacement pattern to $12
.
This will greedy match one or both options, rather than only one option. The modification to the replacement pattern is because we have added a group.
Now that we have modified the regex, our original "option 2" no longer makes sense. Given our new pattern template option1(option2)?|option2
, it will be impossible for group 2 to match "675-441-3144;;;78888464#"
. This is because our original "option 1" will match all of "675-441-3144"
and then stop. Our original "option 2" will then attempt to match ";;;78888464#"
, but will be unable to because it begins with a mandatory capture group of 6-10 digits: (\d{6,16})
, but ";;;78888464#"
begins with a semicolon.
Convert the contents of our original "option 2" from
(\d{6,16})([;,\.]{1,3}\d{3,}#?)?
to
([;,\.]{1,3}\d{3,}#?)?
We have one final problem to solve. Now that our original "option 2" consists only of a single group with the ?
quantifier, it is possible for it to successfully match a zero-length substring. So our pattern template option1(newoption2)?|newoption2
could result in a zero-length match, which does not fulfill the intended purpose of matching phone numbers.
Do both of the following:
Convert the contents of our new "option 2" from
([;,.]{1,3}\d{3,}#?)?
to
[;,.]{1,3}\d{3,}#?
Change $12
in our replacement string to $10
, since we have now removed one group in two locations.
Putting everything together, our final solution is as follows.
Search regex:
(>[^<]*?)((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})([;,\.]{1,3}\d{3,}#?)?|[;,\.]{1,3}\d{3,}#?)([^<]*<)
Replacement regex:
$1<a href="tel:$2">$2</a>$10
Java:
final String searchRegex = "(>[^<]*?)((\\+?\\d{1,4}[ \\t\\f\\-\\.](\\d[ \\t\\f\\-\\.])?)?(\\(\\d{1,4}([\\s-]\\d{1,4})?\\)[\\.\\- \\t\\f])?((\\d{2,6}[\\.\\- \\t\\f])+\\d{2,6})([;,\\.]{1,3}\\d{3,}#?)?|[;,\\.]{1,3}\\d{3,}#?)([^<]*<)";
final String replacementRegex = "$1<a href=\"tel:$2\">$2</a>$10";
String text = "<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>";
String replacement = text.replaceAll(searchRegex, replacementRegex);
Upvotes: 3
Reputation: 111
Well, after trying to do it with replaceall without success, I had to implement the replacement method by myself:
public static String parsePhoneNumbers(String html){
StringBuilder regex = new StringBuilder(120);
regex.append("(>[^<]*?)(")
.append("((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?")
.append("(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?")
.append("((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})")
.append("([;,\.]{1,3}\d{3,}#?)?)")
.append(")+([^<]*<)");
StringBuilder mutableHtml = new StringBuilder(html.length());
Pattern pattern = Pattern.compile(regex.toString());
Matcher matcher = pattern.matcher(html);
int start = 0;
while(matcher.find()){
mutableHtml.append(html.substring(start, matcher.start()));
mutableHtml.append(matcher.group(1)).append("<a href=\"tel:")
.append(matcher.group(2)).append("\">").append(matcher.group(2))
.append("</a>").append(matcher.group(matcher.groupCount()));
start = matcher.end();
}
mutableHtml.append(html.substring(start));
return mutableHtml.toString();
}
Upvotes: -1