TCCV
TCCV

Reputation: 3182

Regex to replace nested tokens

I have a need to take a regex pattern and escape curly braces programatically. The input regex will match the following patterns (with text before, after and between tags):

&{token1}
&{token1}&{token2}&{tokenN...}
&{token1&{token2&{tokenN...}}}

So far, I am fine with everything except the nested tags. This is what I have.

regex = regex.replaceAll("(&)(\\{)([^{}]+)(\\})", "$1\\\\$2$3\\\\$4");

I have also tried to use iteration and recursion but the problem that I'm running into is that once the innermost token is escaped, it messes with the match.

I have tried negative lookbehinds, but that doesn't do what I expect. It will only match/replace the innermost token.

regex = regex.replaceAll("(&)(\\{)([^(?<!\\\\{)|(?<!\\\\})]+)(\\})", "$1\\\\$2$3\\\\$4");

Any suggestions? Thanks in advance.

Edit: Example input/output

&{token1}   //input
&\{token1\} //output

&{token1}&{token2}&{tokenN...}        //input
&\{token1\}&\{token2\}&\{tokenN...\}  //output

&{token1&{token2&{tokenN...}}}        //input
&{token1&{token2&\{tokenN...\}}}      //output
&\{token1&\{token2&\{tokenN...\}\}\}  //expected output

//To throw a wrench into it, normal quantifiers should not be escaped
text{1,2}&{token1&{token2&{tokenN...}}}        //input
text{1,2}&{token1&{token2&\{tokenN...\}}}      //output
text{1,2}&\{token1&\{token2&\{tokenN...\}\}\}  //expected output

Edit 2: Example of what happens outside of this process: The tags will be resolved to text and then in the end, it should be a valid regex.

a{2}&{token1&{token2&{tokenN...}}}        //input
a{2}&\{token1&\{token2&\{tokenN...\}\}\}  //expected output of this regex
a{2}foobarbaz                             //expected output after tokens are resolved (&{token1} = foo, &{token2} = bar, &{tokenN...} = baz) 

Upvotes: 4

Views: 289

Answers (2)

Pshemo
Pshemo

Reputation: 124245

I would avoid regex and create simple state machine which will store sequence of decisions about escaping of {. Based on this informations each time when we find } we can make appropriate decision to escape or unescape it and remove that last information since we don't need it anymore.

So your code can look something like

public static String myEscape(String text){
    StringBuilder sb = new StringBuilder();

    char prev = '\0';
    Stack<Boolean> stack = new Stack<>();

    for (char ch : text.toCharArray()){
        if (ch == '{'){
            if (prev == '&'){
                sb.append('\\');
            }
            stack.push(prev == '&');
        }else if (ch == '}'){
            if (stack.pop()){
                sb.append('\\');
            }
        }
        sb.append(ch);
        prev = ch;
    }
    return sb.toString();
}

Example:

text{1,2}&{token1&{token2{foo}...}}
  • we find first { and see that it was not preceded by & we place in stack false
  • when we find } and based on top value from stack (false) decide that it should not be escaped
  • we see another { and since it is preceded by & we place on top of stack true
  • we find another { and since it is also preceded by & we place at top of stack another true
  • we find another { which this time is not preceded by & so we place at top of stack false

So as we see stack stores informations about if we should escape next } or not, which currently false -> true -> true we can see that next } means that we should expect } \} \}.

Upvotes: 1

m.cekiera
m.cekiera

Reputation: 5395

Try with:

regex = regex.replaceAll("(?<=&)(?=\\{)|(?<!\\{\\d{0,6},?(\\d{0,6})?)(?=\\})","\\\\");

where (0,6) determine how many digits could be there, 6 is enough I think Java example:

public class Main {
    public static void main(String[] args){
        int i = 0;
        String regex = "&{token1}&{token2}&{tokenN}\n" +
                "&{token1&{token2&{tokenN}}}\n" +
                "text{1,2}&{token1{1}&{token2{1,}&{tokenN{0,2}}}}\n";
        regex = regex.replaceAll("(?<=&)(?=\\{)|(?<!\\{\\d{0,6},?(\\d{0,6})?)(?=\\})","\\\\");
        System.out.println(regex);
    }
}

with output:

&\{token1\}&\{token2\}&\{tokenN\}
&\{token1&\{token2&\{tokenN\}\}\}
text{1,2}&\{token1{1}&\{token2{1,}&\{tokenN{0,2}\}\}\}

Upvotes: 1

Related Questions