Kuri
Kuri

Reputation: 111

Java regex backreference for two digits

I am working with a regex and I want to use it on the replaceAll method of the String class in Java.

My regex works fine and groupCount() returns 11. So, when I try to replace my text using backreference pointing to the eleventh group, I am getting the first group with a "1" attached to it, instead of the group eleven.

String regex = "(>[^<]*?)((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)([^<]*<)";
String text = "<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>":
String replacement = text.replaceAll(regex, $1<a href="tel:$2">$2</a>$11");

I am expecting to get the following result:

<span style=\"font-size:11.0pt\"><a href=\"tel:675-441-3144;;;78888464#\">675-441-3144;;;78888464#</a><o:p></o:p></span>

But the $11 backreference is not returning the 11th group, it is returning the first group with a 1 attached to it, and instead I am getting the following result:

<span style="font-size:11.0pt"><a href="tel:675-441-3144">675-441-3144</a>>1o:p></o:p></span>

Can someone please tell me how to access the eleventh group of my pattern?

Thanks.

Upvotes: 0

Views: 1014

Answers (2)

Travis
Travis

Reputation: 2188

Short Answer

The way you access the eleventh group of a match in the replacement is with $11.

Explanation:

As the corresponding Javadoc* states:

The replacement string may contain references to subsequences captured during the previous match: Each occurrence of ${name} or $g will be replaced by the result of evaluating the corresponding group(name) or group(g) respectively. For $g, the first number after the $ is always treated as part of the group reference. Subsequent numbers are incorporated into g if they would form a legal group reference.

So generally speaking, as long as have at least eleven groups, then "$11" will evaluate to group(11). However, if you do not have at least eleven groups, then "$11" will evaluate to group(1) + "1".

* This quote is from Matcher#appendReplacement(StringBuffer,String), which is where the chain of relevant citations from String#replaceAll(String,String) leads to.


Actual Answer

Your regex does not do what you think it does.

Part 1

The Problem

Let's divide your regex into its three top-level groups. These are groups 1, 2, and 11, respectively.

  • Group 1:
    (>[^<]*?)
  • Group 2:
    ((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)
  • Group 11:
    ([^<]*<)

Group 2 is the main body of your regex, and it consists of a top-level alternation over two options. These two options consist of groups 3-8 and 9-10, respectively.

  • First option:
    ((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})
  • Second option:
    (\d{6,16})([;,\.]{1,3}\d{3,}#?)?)

Now, given the text string, here is what is going on:

  1. Group 1 executes. It matches the first ">".
  2. Group 2 executes. It evaluates the options of its alternation in order.
    1. The first option of group 2's alternation executes. It matches "675-441-3144".
    2. Group 2's alternation successfully short-circuits upon the match of one of its options.
      • Group 2 as a whole is now equal to the option that matched, which is "675-441-3144".
      • The cursor is now positioned immediately after "675-441-3144", which is immediately before ";;;78888464#".
  3. Group 11 executes. It matches everything up through the next "<", which is all of ";;;78888464#<".

Thus, some of the content that you want to be in group 2 is actually in group 11 instead.

The Solution

Do both of the following two things:

  • Convert the contents of group 2 from

    option1|option2
    

    to

    option1(option2)?|option2
    
  • Change $11 in your replacement pattern to $12.

This will greedy match one or both options, rather than only one option. The modification to the replacement pattern is because we have added a group.

Part 2

The Problem

Now that we have modified the regex, our original "option 2" no longer makes sense. Given our new pattern template option1(option2)?|option2, it will be impossible for group 2 to match "675-441-3144;;;78888464#". This is because our original "option 1" will match all of "675-441-3144" and then stop. Our original "option 2" will then attempt to match ";;;78888464#", but will be unable to because it begins with a mandatory capture group of 6-10 digits: (\d{6,16}), but ";;;78888464#" begins with a semicolon.

The Solution

Convert the contents of our original "option 2" from

(\d{6,16})([;,\.]{1,3}\d{3,}#?)?

to

([;,\.]{1,3}\d{3,}#?)?

Part 3

The Problem

We have one final problem to solve. Now that our original "option 2" consists only of a single group with the ? quantifier, it is possible for it to successfully match a zero-length substring. So our pattern template option1(newoption2)?|newoption2 could result in a zero-length match, which does not fulfill the intended purpose of matching phone numbers.

The Solution

Do both of the following:

  • Convert the contents of our new "option 2" from

    ([;,.]{1,3}\d{3,}#?)?

    to

    [;,.]{1,3}\d{3,}#?

  • Change $12 in our replacement string to $10, since we have now removed one group in two locations.


The Final Solution

Putting everything together, our final solution is as follows.

Search regex:

(>[^<]*?)((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})([;,\.]{1,3}\d{3,}#?)?|[;,\.]{1,3}\d{3,}#?)([^<]*<)

Replacement regex:

$1<a href="tel:$2">$2</a>$10

Java:

final String searchRegex = "(>[^<]*?)((\\+?\\d{1,4}[ \\t\\f\\-\\.](\\d[ \\t\\f\\-\\.])?)?(\\(\\d{1,4}([\\s-]\\d{1,4})?\\)[\\.\\- \\t\\f])?((\\d{2,6}[\\.\\- \\t\\f])+\\d{2,6})([;,\\.]{1,3}\\d{3,}#?)?|[;,\\.]{1,3}\\d{3,}#?)([^<]*<)";
final String replacementRegex = "$1<a href=\"tel:$2\">$2</a>$10";

String text = "<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>";
String replacement = text.replaceAll(searchRegex, replacementRegex);

Proof of correctness

Upvotes: 3

Kuri
Kuri

Reputation: 111

Well, after trying to do it with replaceall without success, I had to implement the replacement method by myself:

public static String parsePhoneNumbers(String html){
    StringBuilder regex = new StringBuilder(120);
    regex.append("(>[^<]*?)(")
       .append("((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?")
       .append("(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?")
       .append("((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})")
       .append("([;,\.]{1,3}\d{3,}#?)?)") 
       .append(")+([^<]*<)");

    StringBuilder mutableHtml = new StringBuilder(html.length());
    Pattern pattern = Pattern.compile(regex.toString());
    Matcher matcher = pattern.matcher(html);
    int start = 0;

    while(matcher.find()){
        mutableHtml.append(html.substring(start, matcher.start()));
        mutableHtml.append(matcher.group(1)).append("<a href=\"tel:")
                .append(matcher.group(2)).append("\">").append(matcher.group(2))
                .append("</a>").append(matcher.group(matcher.groupCount()));
        start = matcher.end();

    }
    mutableHtml.append(html.substring(start));
    return mutableHtml.toString();
}

Upvotes: -1

Related Questions