Mene
Mene

Reputation: 3799

Replacing repeatedly occuring groups of an anchored regex in java

Using Java 7 and the default RegEx implementatiin in java.util.regex.Pattern, given a regex like this:

^start (m[aei]ddel[0-9] ?)+ tail$

And a string like this:

start maddel1 meddel2 middel3 tail

Is it possible to get an output like this using the anchored regex:

start <match> <match> <match> tail.

I can get every group without anchors like this:

Regex: m[aei]ddel[0-9]

StringBuffer sb = new StringBuffer();
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
    matcher.appendReplacement(sb, Matcher.quoteReplacement("<middle>"));
}

The problem is that I'm working on a quite big dataset and being able to anchor the patterns would be a huge performance win.

However when I add the anchors the only API that I can find requires a whole match and accessing the last occurrence of the group. I my case I need to verify that the regex actually matches (i.e. a whole match), but in the replacement step I need to be able to access every group on it's own.

edit I'd like to avoid workarounds like looking for the anchors in a separate step because it would require bigger changes to the code and wrapping it all up in RegExes feels more elegant.

Upvotes: 3

Views: 82

Answers (3)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89565

With the \G anchor, for the find method, you can write it this way:

pat = "\\G(?:(?!\\A) |\\Astart (?=(?:m[aei]ddel[0-9] )+tail\\z))(m\\S+)";

details:

\\G # position after the previous match or at the start of the string
    # putting it in factor makes fail the pattern more quickly after the last match
(?:
    (?!\\A) [ ] # a space not at the start of the string
                # this branch is the first one because it has more chance to succeed
  |
    \\A start [ ] # "start " at the beginning of the string
    (?=(?:m[aei]ddel[0-9] )+tail\\z) # check the string format once and for all
                                     # since this branch will succeed only once
)
( # capture group 1
    m\\S+ # the shortest and simplest pattern that matches "m[aei]ddel[0-9]"
          # and excludes "tail" (adapt it to your need but keep the same idea)
)

demo

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626952

To do it in one step, you need to use a \G based regex that will do the anchoring. However, you also need a positive lookahead to check if the string ends with the desired pattern.

Here is a regex that should work:

(^start|(?!\A)\G)\s+m[aei]ddel[0-9](?=(?:\s+m[aei]ddel[0-9])*\s+tail$)

See the regex demo

String s = "start maddel1 meddel2 middel3 tail";
String pat = "(^start|(?!\\A)\\G)\\s+(m[aei]ddel[0-9])(?=(?:\\s+m[aei]ddel[0-9])*\\s+tail$)";
System.out.println(s.replaceAll(pat, "$1 <middle>" )); 

See the Java online demo

Explanation:

  • (^start|(?!\A)\G) - match start at the end of string or the end of the previous successful match
  • \s+ - 1 or more whitespaces
  • m[aei]ddel[0-9] - m, then either a, e, i, then ddel, then 1 digit
  • (?=(?:\s+m[aei]ddel[0-9])*\s+tail$) - only if followed with:
    • (?:\s+m[aei]ddel[0-9])* - zero or more sequences of 1+ whitespaces and middelN pattern
    • \s+ - 1 or more whitespaces
    • tail$ - tails substring followed with the end of string.

Upvotes: 2

anubhava
anubhava

Reputation: 785316

You can use \G for this:

final String regex = "(^start |(?<!^)\\G)m[aei]ddel[0-9] (?=.* tail$)";
final String str = "start maddel1 meddel2 middel3 tail";

String repl = str.replaceAll(regex, "$1<match> ");
//=> start <match> <match> <match> tail

RegEx Demo

\G asserts position at the end of the previous match or the start of the string for the first match.

Upvotes: 3

Related Questions