driima
driima

Reputation: 643

Match contents within square brackets, including nested square brackets

I am attempting to write a spoiler identification system so that any spoilers in a string are replaced with a specified spoiler character.

I want to match a string surrounded by square brackets, such that the contents within the square brackets is capture group 1, and the whole string including the surrounding brackets is the match.

I am currently using \[(.*?]*)\], a slight modification of the expression found in this answer here, as I also want nested square brackets to be a part of capture group 1.

The problem with that expression is that, although it works and matches the following:

However, if I want to match the following, it does not work as expected:

What expression should I use such that it matches [sandwich with [pickles] and [onions]] with sandwich with [pickles] and [onions] as group 1?

EDIT:

As it seems impossible to achieve this in Java using regex, is there an alternative solution?

EDIT 2:

I also want to be able to split the string by each match found, so an alternative to regular expressions would be harder to implement due to String.split(regex) being convenient. Here's an example:

And the split sentence should look like:

Jim ate a
with
and

Upvotes: 2

Views: 1115

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626699

More direct solution

This solution will omit empty or whitespace only substrings

public static List<String> getStrsBetweenBalancedSubstrings(String s, Character markStart, Character markEnd) {
    List<String> subTreeList = new ArrayList<String>();
    int level = 0;
    int lastCloseBracket= 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
            if (c == markStart) {
                    level++;
                    if (level == 1 && i != 0 && i!=lastCloseBracket &&
                        !s.substring(lastCloseBracket, i).trim().isEmpty()) {
                            subTreeList.add(s.substring(lastCloseBracket, i).trim());
                }
            }
        } else if (c == markEnd) {
            if (level > 0) { 
                level--;
                lastCloseBracket = i+1;
            }
            }
    }
    if (lastCloseBracket != s.length() && !s.substring(lastCloseBracket).trim().isEmpty()) {
        subTreeList.add(s.substring(lastCloseBracket).trim());  
    }
    return subTreeList;
}

Then, use it as

String input = "Jim ate a [sandwich][ooh] with [pickles] and [dried [onions]] and ] [an[other] match] and more here";
List<String> between_balanced =  getStrsBetweenBalancedSubstrings(input, '[', ']');
System.out.println("Result: " + between_balanced);
// => Result: [Jim ate a, with, and, and ], and more here]

Original answer (more complex, shows a way to extract nested parentheses)

You can also extract all substrings inside balanced parentheses and then split with them:

String input = "Jim ate a [sandwich] with [pickles] and [dried [onions]] and ] [an[other] match]";
List<String> balanced = getBalancedSubstrings(input, '[', ']', true);
System.out.println("Balanced ones: " + balanced);
List<String> rx_split = new ArrayList<String>();
for (String item : balanced) {
    rx_split.add("\\s*" + Pattern.quote(item) + "\\s*");
}
String rx = String.join("|", rx_split);
System.out.println("In-betweens: " + Arrays.toString(input.split(rx)));

And this function will find all []-balanced substrings:

public static List<String> getBalancedSubstrings(String s, Character markStart, 
                                     Character markEnd, Boolean includeMarkers) {
    List<String> subTreeList = new ArrayList<String>();
    int level = 0;
    int lastOpenBracket = -1;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (c == markStart) {
            level++;
            if (level == 1) {
                lastOpenBracket = (includeMarkers ? i : i + 1);
            }
        }
        else if (c == markEnd) {
            if (level == 1) {
                subTreeList.add(s.substring(lastOpenBracket, (includeMarkers ? i + 1 : i)));
            }
            if (level > 0) level--;
        }
    }
    return subTreeList;
}

See IDEONE demo

Result of the code execution:

Balanced ones: ['[sandwich], [pickles], [dried [onions]]', '[an[other] match]']
In-betweens: ['Jim ate a', 'with', 'and', 'and ]']

Credits: the getBalancedSubstrings is based on the peter.murray.rust's answer for How to split this “Tree-like” string in Java regex? post.

Upvotes: 2

Related Questions