Reputation: 8933

Regular expression to transform brackets and nested brackets when inside a markup

I want to write a regex that can remove the brackets surrounding [cent]

String input1 = "this is a [cent] and [cent] string" 
String output1 = "this is a cent and cent string"

But if it is nested like:

String input2="this is a [cent[cent] and [cent]cent] string"
String output2="this is a cent[cent and cent]cent string"

I can only use replaceAll on the string so, how do I create the pattern in the code below ? and what should the replacement string be ?

Pattern rulerPattern1 = Pattern.compile("", Pattern.MULTILINE);
System.out.println(rulerPattern1.matcher(input1).replaceAll(""));

Update: nested brackets are well-formed and can be only two levels deep, like in case 2.

Edit: If this is the string "[<centd>[</centd>]purposes[<centd>]</centd>]"; then OUPTUT should be <centd>[</centd> purposes <centd>]</centd> .. basically if the brackets is between centd begin and end leave it there or else remove

Upvotes: 1

Answers (5)

Ro Yo Mi

Reputation: 15010

Description

This regex would replace the brackets based on having space on only one side of the bracket.

regex: (?<=\s)[\[\]](?=\S)|(?<=\S)[\[\]](?=\s)

replace with empty string

enter image description here

Summary

Sample 1
- Input: this is a [cent[cent] and [cent]cent] string
- Output this is a cent[cent and cent]cent string
Sample 2
- Input: this is a [cent[cent] and [cent]cent] string
- Output this is a cent[cent and cent]cent string
Sample 3
- Input: [<cent>[</cent>] and [<cent>]Chemotherapy services.</cent>]
- Output [<cent>[</cent> and <cent>]Chemotherapy services.</cent>]

To address the edit on the question this expression will find:

[<centd>[</centd>] and replaces them with <centd>[</centd>
[<centd>] or [</centd>], and removes just the outer square brackets
all other square brackets are retained

regex: \[(<centd>[\[\]]<\/centd>)\]|\[(<\/?centd>)\]

replace with: $1$2

enter image description here

Sample 4
- Input: [<centd>[</centd>]purposes[<centd>]</centd>]
- Output <centd>[</centd>pur [T] poses<centd>]</centd>

Upvotes: 6

grepit

Reputation: 22392

You can use java matcher to transform brackets. I did the one for you below:

         String input = "this is a [cent[cent] and [cent]cent] string";
         Pattern p = Pattern.compile("\\[((?:[^\\[\\]]++|\\[[^\\[\\]]*+\\])*+)\\]");
         Matcher m = p.matcher(input);

Upvotes: -1

nhahtdh

Reputation: 56829

Assumptions

From the question, the assumption is that there are no more than 2 levels of nesting brackets. It is also assumed that the brackets are balanced.

I further makes the assumption that you don't allow escaping of [].

I also assume that when there are nested brackets, only the first opening [ and the last closing ] brackets of the inner brackets are preserved. The rest, i.e. the top level brackets and the rest of the inner brackets are removed.

For example:

only[single] [level] outside[text more [text] some [text]moreeven[more]text[bracketed]] still outside

After replacement will become:

onlysingle level outsidetext more [text some textmoreevenmoretextbracketed] still outside

Aside from the assumptions above, there is no other assumption.

If you can make the assumption about spacing before and after brackets, then you can use the simpler solution by Denomales. Otherwise, my solution below will work without such assumption.

Solution

private static String replaceBracket(String input) {
    // Search for singly and doubly bracketed text
    Pattern p = Pattern.compile("\\[((?:[^\\[\\]]++|\\[[^\\[\\]]*+\\])*+)\\]");
    Matcher matcher = p.matcher(input);

    StringBuffer output = new StringBuffer(input.length());

    while (matcher.find()) {
        // Take the text inside the outer most bracket
        String innerText = matcher.group(1);
        int startIndex = innerText.indexOf("[");
        int endIndex;

        String replacement;

        if (startIndex != -1) {
            // 2 levels of nesting
            endIndex = innerText.lastIndexOf("]");

            // Remove all [] except for first [ and last ]
            replacement = 
                // Text before and including first [
                innerText.substring(0, startIndex + 1) + 
                // Text inbetween, stripped of all the brackets []
                innerText.substring(startIndex + 1, endIndex).replaceAll("[\\[\\]]", "") +
                // Text after and including last ]
                innerText.substring(endIndex);
        } else {
            // No nesting
            replacement = innerText;
        }

        matcher.appendReplacement(output, replacement);
    }

    matcher.appendTail(output);

    return output.toString();
}

Explanation

The only thing that is worth explaining here is the regex. The rest you can check out the documentation of Matcher class.

"\\[((?:[^\\[\\]]++|\\[[^\\[\\]]*+\\])*+)\\]"

In RAW form (when you print out the string):

\[((?:[^\[\]]++|\[[^\[\]]*+\])*+)\]

Let us break it up (spaces are insignificant):

\[                    # Outermost opening bracket
(                     # Capturing group 1
  (?:
    [^\[\]]++         # Text that doesn't contain []
    |                 # OR
    \[[^\[\]]*+\]     # A nested bracket containing text without []
  )*+
)                     # End of capturing group 1
\]                    # Outermost closing bracket

I used possessive quantifiers *+ and ++ in order to prevent backtracking by the regex engine. The version with normal greedy quantifier \[((?:[^\[\]]+|\[[^\[\]]*\])*)\] would still work, but will be slightly inefficient and can cause a StackOverflowError on big enough input.

Upvotes: 0

Mena

Reputation: 48444

If it's really only about finding brackets surrounding "cent", you could use the following approach (with lookbehind, lookahead):

Edited to leave some of the brackets as per expected output: this is now a combination of positive and negative lookbehinds and lookaheads. In other words, it's unlikely that regex is the solution, but does work with the literals provided and then some.

// surrounding
String test1 = "this is a [cent] and [cent] string";
// pseudo-nested
String test2 = "this is a [cent[cent] and [cent]cent] string";
// nested
String test3 = "this is a [cent[cent]] and [cent]cent]] string";
Pattern pattern = Pattern.compile("((?<!cent)\\[+(?=cent))|((?<=cent)\\]+(?!cent))");
Matcher matcher = pattern.matcher(test1);
if (matcher.find()) {
    System.out.println(matcher.replaceAll(""));
}
matcher = pattern.matcher(test2);
if (matcher.find()) {
    System.out.println(matcher.replaceAll(""));
}
matcher = pattern.matcher(test3);
if (matcher.find()) {
    System.out.println(matcher.replaceAll(""));
}

Output:

this is a cent and cent string
this is a cent[cent and cent]cent string
this is a cent[cent and cent]cent string

Upvotes: 0

9000

Reputation: 40904

Regular expressions are unfit for the purpose in general case. Nested structures is a recursive grammar, not a regular grammar. (That's why you don't parse HTML with regular expressions, BTW.)

If you only have a limited depth of bracket nesting, you can write a regular expression for that. Buy you need to state your nesting depth first, and the regexp will not be all that pretty.

Upvotes: 0

Regular expression to transform brackets and nested brackets when inside a markup

Answers (5)

Description

Summary

Assumptions

Solution

Explanation

Related Questions