alecigne
alecigne

Reputation: 81

How to match a string with a regex only if it's between two delimiters?

My goal is to delete all matches from an input using a regular expression with Java 7:

input.replaceAll([regex], "");

Given this example input with a target string abc-:

<TAG>test-test-abc-abc-test-abc-test-</TAG>test-abc-test-abc-<TAG>test-abc-test-abc-abc-</TAG>

What regex could I use in the code above to match abc- only when it is between the <TAG> and </TAG> delimiters? Here is the desired matching behaviour, with <--> for a match:

               <--><-->     <-->                                       <-->     <--><-->
<TAG>test-test-abc-abc-test-abc-test-</TAG>test-abc-test-abc-<TAG>test-abc-test-abc-abc-</TAG>

Expected result:

<TAG>test-test-test-test-</TAG>test-abc-test-abc-<TAG>test-test-</TAG>

The left and right delimiters are always different. I am not particularly looking for a recursive solution (nested delimiters).

I think this might be doable with lookaheads and/or lookbehinds but I didn't get anywhere with them.

Upvotes: 0

Views: 76

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

You can use a regex like

(?s)(\G(?!^)|<TAG>(?=.*?</TAG>))((?:(?!<TAG>|</TAG>).)*?)abc-

See the regex demo. Replace with $1$2. Details:

  • (?s) - a Pattern.DOTALL embedded flag option
  • (\G(?!^)|<TAG>(?=.*?</TAG>)) - Group 1 ($1): either of the two:
    • \G(?!^) - end of the previous successful match
    • | - or
    • <TAG>(?=.*?</TAG>) - <TAG> that is immediately followed with any zero or more chars, as few as possible, followed with </TAG> (thus, we make sure there is actually the closing, right-hand delimiter further in the string)
  • ((?:(?!<TAG>|</TAG>).)*?) - Group 2 ($2): any one char (.), zero or more repetitions, but as few as possible (*?) that does not start a <TAG> or </TAG> char sequences (aka tempered greedy token)
  • abc- - the pattern to be removed, abc-.

In Java:

String pattern = "(?s)(\\G(?!^)|<TAG>(?=.*?</TAG>))((?:(?!<TAG>|</TAG>).)*?)abc-";
String result = text.replaceAll(pattern, "$1$2");

Upvotes: 1

Related Questions