Ricardo
Ricardo

Reputation: 1411

Avoid overlapping regex matching in Java

For some reason this piece of Java code is giving me overlapping matches:

Pattern pat = Pattern.compile("(" + leftContext + ")" + ".*" + "(" + rightContext + ")", Pattern.DOTALL);

any way/option so it avoids detecting overlaps? e.g. leftContext rightContext rightContext should be be 1 match instead of 2

Here's the complete code:

public static String replaceWithContext(String input, String leftContext, String rightContext, String newString){   
  Pattern pat = Pattern.compile("(" + leftContext + ")" + ".*" + "(" + rightContext + ")", Pattern.DOTALL);
  Matcher matcher = pat.matcher(input);
  StringBuffer buffer = new StringBuffer();

  while (matcher.find()) { 
   matcher.appendReplacement(buffer, "");
   buffer.append(matcher.group(1) + newString + matcher.group(2));
  }
  matcher.appendTail(buffer);

  return buffer.toString();
 }

So here's the final answer using a negative lookahead, my bad for not realizing * was greedy:

Pattern pat = Pattern.compile("(" +
    leftContext + ")" + "(?:(?!" +
    rightContext + ").)*" + "(" +
    rightContext + ")", Pattern.DOTALL);

Upvotes: 4

Views: 1215

Answers (2)

Alan Moore
Alan Moore

Reputation: 75222

Your use of the word "overlapping" is confusing. Apparently, what you meant was that the regex is too greedy, matching everything from the first leftContext to the last rightContext. It seems you figured that out already--and came up with a better approach as well--but there's still at least one potential problem.

You said leftContext and rightContext are "plain Strings", by which I assume you meant they aren't supposed to be interpreted as regexes, but they will be. You need to escape them, or any regex metacharacters they contain will cause incorrect results or run-time exceptions. The same goes for your replacement string, although only $ and the backslash have special meanings there. Here's an example (notice the non-greedy .*?, too):

public static String replaceWithContext(String input, String leftContext, String rightContext, String newString){
  String lcRegex = Pattern.quote(leftContext);
  String rcRegex = Pattern.quote(rightContext);
  String replace = Matcher.quoteReplacment(newString);
  Pattern pat = Pattern.compile("(" + lcRegex + ").*?(" + rcRegex + ")", Pattern.DOTALL);

One other thing: if you aren't doing any post-match processing on the matched text, you can use replaceAll instead of rolling your own with appendReplacement and appendTail:

return input.replaceAll("(?s)(" + lcRegex + ")" +
                        "(?:(?!" + rcRegex + ").)*" +
                        "(" + rcRegex + ")",
    "$1" + replace + "$2");

Upvotes: 2

darioo
darioo

Reputation: 47183

There are few possibilities, depending on what you really need.

You can append $ at the end of your regex, like this:

"(" + leftContext + ")" + ".*" + "(" + rightContext + ")$"

so if rightContext isn't the last thing, your regex won't match.

Next, you can capture everything after rightContext:

"(" + leftContext + ")" + ".*" + "(" + rightContext + ")(.*)"

and after that discard everything in your third matching group.

But, since we don't know what leftContext and rightContext really are, maybe your problem lies within them.

Upvotes: 1

Related Questions