xtian
xtian

Reputation: 3169

Using Matcher.appendReplacement() with multiple regions

The java Matcher.appendReplacement() method (with appendTail()) is supposed to let me transform a source text into a result text while replacing all occurrences of a pattern. The algorithm in pseudolanguage would be something like:

while Matcher.find() {
  call Matcher.appendReplacement()
}
call Matcher.appendTail()

If the pattern is searched only inside a given region, all is fine:

call Matcher.region()
while Matcher.find() {
  call Matcher.appendReplacement()
}
call Matcher.appendTail()

The problem arises when, after matching inside a region, I want to move the region further:

call Matcher.region()
while Matcher.find() {
  call Matcher.appendReplacement()
}
call Matcher.region()
while Matcher.find() {
  call Matcher.appendReplacement()
}
call Matcher.appendTail()

This doesn't work because region() resets the matcher so that Matcher.appendReplacement() restarts from the beginning of the text, causing the result to contain duplication of some part of the source.

This happens by design, as the javadoc says.

What is the correct way of replacing a pattern that can be located inside more than one region?

Edit: java example added, text example removed

The following java example shows that from an input like

dog1 start dog2a dog2b end dog3 start dog4a dog4b end dog5

you don't get the expected output

dog1 start cat2a cat2b end dog3 start cat4a cat4b end dog5

package test;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TestMatcher {

    public static void main(String[] args) throws Exception {
        String inputText = "dog1 start dog2a dog2b end dog3 start dog4a dog4b end dog5";
        System.out.println("input  = " + inputText);
        StringBuffer result = new StringBuffer();
        Pattern pattern = Pattern.compile("dog");
        Matcher matcher = pattern.matcher(inputText);

        int startPos = inputText.indexOf("start");
        int endPos = inputText.indexOf("end");
        System.out.println("Setting region to " + startPos + "," + endPos);
        matcher.region(startPos, endPos);
        while (matcher.find()) {
            matcher.appendReplacement(result, "cat");
        }
        System.out.println("Partial result = " + result);

        startPos = inputText.indexOf("start", endPos);
        endPos = inputText.indexOf("end", startPos);
        System.out.println("Setting region to " + startPos + "," + endPos);
        matcher.region(startPos, endPos);
        while (matcher.find()) {
            matcher.appendReplacement(result, "cat");
        }
        matcher.appendTail(result);
        System.out.println("Final result   = " + result);
    }
}

Output:

input  = dog1 start dog2a dog2b end dog3 start dog4a dog4b end dog5
Setting region to 5,23
Partial result = dog1 start cat2a cat
Setting region to 32,50
Final result   = dog1 start cat2a catdog1 start dog2a dog2b end dog3 start cat4a cat4b end dog5

Upvotes: 3

Views: 3078

Answers (1)

ankhzet
ankhzet

Reputation: 2568

Doesn't sub-regions must be handled by separate matcher? Like:

public static void main(String[] args) {
  String inputText = "dog1 start dog2a dog2b end dog3 start dog4a dog4b end dog5";

  System.out.println("Input          = " + inputText);
  StringBuffer result = new StringBuffer();
  Pattern pattern = Pattern.compile("(start(.*?)end)");

  Matcher matcher = pattern.matcher(inputText);

  while (matcher.find()) {
    int s = matcher.start();
    int e = matcher.end();
    System.out.printf("(%d .. %d) -> \"%s\"\n", s, e, matcher.group(1));
    matcher.appendReplacement(result, processSubGroup(matcher.group(1), matcher.group(2)));
  }
  matcher.appendTail(result);
  System.out.println("Final result   = " + result);
}

static String processSubGroup(String subGroup, String contents) {
  StringBuffer result = new StringBuffer();
  Pattern pattern = Pattern.compile("dog");

  Matcher matcher = pattern.matcher(subGroup);

  while (matcher.find())
    matcher.appendReplacement(result, "cat");

  matcher.appendTail(result);
  return result.toString();
}

Or, without log-related stuff and more simpler:

public static void main(String[] args) {
  String inputText = "dog1 start dog2a dog2b end dog3 start dog4a dog4b end dog5";

  StringBuffer result = new StringBuffer();
  Pattern pattern = Pattern.compile("(start(.*?)end)");

  Matcher matcher = pattern.matcher(inputText);

  while (matcher.find())
    matcher.appendReplacement(result, processSubGroup(matcher.group(1), matcher.group(2)));

  matcher.appendTail(result);
  System.out.println("Final result   = " + result);
}

static String processSubGroup(String subGroup, String contents) {
  return Pattern.compile("dog").matcher(subGroup).replaceAll("cat");
}

Result:

Input          = dog1 start dog2a dog2b end dog3 start dog4a dog4b end dog5
(5 .. 26) -> "start dog2a dog2b end"
(32 .. 53) -> "start dog4a dog4b end"
Final result   = dog1 start cat2a cat2b end dog3 start cat4a cat4b end dog5

Or more abstract approach:

interface GroupProcessor {
  String process(String group);
}

public static void main(String[] args) {
  String inputText = "dog1 dogs dog2a dog2b enddogs cow1 dog3 cows cow2a cow2b endcows dog4 dogs dog5a dog5b enddogs cow3";

  String result = inputText;

  result = processGroup(result, "dogs*enddogs", (group) -> {
    return Pattern.compile("dog").matcher(group).replaceAll("cat");
  });

  result = processGroup(result, "cows*endcows", (group) -> {
    return Pattern.compile("cow").matcher(group).replaceAll("sheep");
  });

  System.out.println("Input        = " + inputText);
  System.out.println("Final result = " + result);
}

static String processGroup(String input, String regex, GroupProcessor processor) {
  StringBuffer result = new StringBuffer();
  Pattern pattern = Pattern.compile(String.format("(%s)", regex.replace("*", "(.*?)")));

  Matcher matcher = pattern.matcher(input);

  while (matcher.find())
    matcher.appendReplacement(result, processor.process(matcher.group(1)));

  matcher.appendTail(result);
  return result.toString();
}

Which will give us:

Input        = dog1 dogs dog2a dog2b enddogs cow1 dog3 cows cow2a cow2b endcows dog4 dogs dog5a dog5b enddogs cow3
Final result = dog1 cats cat2a cat2b endcats cow1 dog3 sheeps sheep2a sheep2b endsheeps dog4 cats cat5a cat5b endcats cow3

Upd.

The reasons, why Matcher.region() resets implicit matcher state and, thus, lastAppendPosition.

appendReplacement and appendTail is somewhat a move-only-forward mechanism, while .region() is not so deterministic.

Assume following situation: for string of 100 chars you applied region 0..20, performed find()-appendReplacement() loop, then moved region to, f.e., 30..60, and performed replacement loop again.

Now you have 0..100 source string and, f.e., 0..60 replacement result string in StringBuffer.

Next, you applying region 10..40 to source string... and what next? If that region of source string doesn't contains matches - OK, doing nothing, but if it does contain matches? Where should appendReplacement append/insert results of replacement? The result string is already past that 10..40 region and appendReplacement only appends, not replaces partitions of string in output buffer.

If there existed some constraint mechanism, that limited region setup only to something like MAX(start, lastAppendPosition)..MIN(end, sourceLength), then ok, append mechanism would work fine, but .region() method has no such limitations, or they (that limitations) would make .region() method quite useless for searching (which is the main purpose of .region() method).

Thats why .region() resets implicit state of matcher, making it not so useful in conjunction with appendReplacement()-related stuff. If you require different behavior - extend Matcher class via encapsulation.

Upvotes: 1

Related Questions