Reputation: 3799
Using Java 7 and the default RegEx implementatiin in java.util.regex.Pattern, given a regex like this:
^start (m[aei]ddel[0-9] ?)+ tail$
And a string like this:
start maddel1 meddel2 middel3 tail
Is it possible to get an output like this using the anchored regex:
start <match> <match> <match> tail
.
I can get every group without anchors like this:
Regex: m[aei]ddel[0-9]
StringBuffer sb = new StringBuffer();
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
matcher.appendReplacement(sb, Matcher.quoteReplacement("<middle>"));
}
The problem is that I'm working on a quite big dataset and being able to anchor the patterns would be a huge performance win.
However when I add the anchors the only API that I can find requires a whole match and accessing the last occurrence of the group. I my case I need to verify that the regex actually matches (i.e. a whole match), but in the replacement step I need to be able to access every group on it's own.
edit I'd like to avoid workarounds like looking for the anchors in a separate step because it would require bigger changes to the code and wrapping it all up in RegExes feels more elegant.
Upvotes: 3
Views: 82
Reputation: 89565
With the \G
anchor, for the find
method, you can write it this way:
pat = "\\G(?:(?!\\A) |\\Astart (?=(?:m[aei]ddel[0-9] )+tail\\z))(m\\S+)";
details:
\\G # position after the previous match or at the start of the string
# putting it in factor makes fail the pattern more quickly after the last match
(?:
(?!\\A) [ ] # a space not at the start of the string
# this branch is the first one because it has more chance to succeed
|
\\A start [ ] # "start " at the beginning of the string
(?=(?:m[aei]ddel[0-9] )+tail\\z) # check the string format once and for all
# since this branch will succeed only once
)
( # capture group 1
m\\S+ # the shortest and simplest pattern that matches "m[aei]ddel[0-9]"
# and excludes "tail" (adapt it to your need but keep the same idea)
)
Upvotes: 2
Reputation: 626952
To do it in one step, you need to use a \G
based regex that will do the anchoring. However, you also need a positive lookahead to check if the string ends with the desired pattern.
Here is a regex that should work:
(^start|(?!\A)\G)\s+m[aei]ddel[0-9](?=(?:\s+m[aei]ddel[0-9])*\s+tail$)
See the regex demo
String s = "start maddel1 meddel2 middel3 tail";
String pat = "(^start|(?!\\A)\\G)\\s+(m[aei]ddel[0-9])(?=(?:\\s+m[aei]ddel[0-9])*\\s+tail$)";
System.out.println(s.replaceAll(pat, "$1 <middle>" ));
See the Java online demo
Explanation:
(^start|(?!\A)\G)
- match start
at the end of string or the end of the previous successful match\s+
- 1 or more whitespacesm[aei]ddel[0-9]
- m
, then either a
, e
, i
, then ddel
, then 1 digit(?=(?:\s+m[aei]ddel[0-9])*\s+tail$)
- only if followed with:
(?:\s+m[aei]ddel[0-9])*
- zero or more sequences of 1+ whitespaces and middelN
pattern \s+
- 1 or more whitespacestail$
- tails
substring followed with the end of string.Upvotes: 2
Reputation: 785316
You can use \G
for this:
final String regex = "(^start |(?<!^)\\G)m[aei]ddel[0-9] (?=.* tail$)";
final String str = "start maddel1 meddel2 middel3 tail";
String repl = str.replaceAll(regex, "$1<match> ");
//=> start <match> <match> <match> tail
\G
asserts position at the end of the previous match or the start of the string for the first match.
Upvotes: 3