Reputation: 643
I am attempting to write a spoiler identification system so that any spoilers in a string are replaced with a specified spoiler character.
I want to match a string surrounded by square brackets, such that the contents within the square brackets is capture group 1, and the whole string including the surrounding brackets is the match.
I am currently using \[(.*?]*)\]
, a slight modification of the expression found in this answer here, as I also want nested square brackets to be a part of capture group 1.
The problem with that expression is that, although it works and matches the following:
Jim ate a [sandwich]
matches [sandwich]
with sandwich
as group 1Jim ate a [sandwich with [pickles and onions]]
matches [sandwich with [pickles and onions]]
with sandwich with [pickles and onions]
as group 1[[[[]
matches [[[[]
with [[[
as group 1[]]]]
matches []]]]
with ]]]
as group 1However, if I want to match the following, it does not work as expected:
Jim ate a [sandwich with [pickles] and [onions]]
matches both:
[sandwich with [pickles]
with sandwich with [pickles
as group 1[onions]]
with onions]
as group 1What expression should I use such that it matches [sandwich with [pickles] and [onions]]
with sandwich with [pickles] and [onions]
as group 1?
EDIT:
As it seems impossible to achieve this in Java using regex, is there an alternative solution?
EDIT 2:
I also want to be able to split the string by each match found, so an alternative to regular expressions would be harder to implement due to String.split(regex)
being convenient. Here's an example:
Jim ate a [sandwich] with [pickles] and [dried [onions]]
matches all:
[sandwich]
with sandwich
as group 1[pickles]
with pickles
as group 1[dried [onions]]
with dried [onions]
as group 1And the split sentence should look like:
Jim ate a
with
and
Upvotes: 2
Views: 1115
Reputation: 626699
This solution will omit empty or whitespace only substrings
public static List<String> getStrsBetweenBalancedSubstrings(String s, Character markStart, Character markEnd) {
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastCloseBracket= 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1 && i != 0 && i!=lastCloseBracket &&
!s.substring(lastCloseBracket, i).trim().isEmpty()) {
subTreeList.add(s.substring(lastCloseBracket, i).trim());
}
}
} else if (c == markEnd) {
if (level > 0) {
level--;
lastCloseBracket = i+1;
}
}
}
if (lastCloseBracket != s.length() && !s.substring(lastCloseBracket).trim().isEmpty()) {
subTreeList.add(s.substring(lastCloseBracket).trim());
}
return subTreeList;
}
Then, use it as
String input = "Jim ate a [sandwich][ooh] with [pickles] and [dried [onions]] and ] [an[other] match] and more here";
List<String> between_balanced = getStrsBetweenBalancedSubstrings(input, '[', ']');
System.out.println("Result: " + between_balanced);
// => Result: [Jim ate a, with, and, and ], and more here]
You can also extract all substrings inside balanced parentheses and then split with them:
String input = "Jim ate a [sandwich] with [pickles] and [dried [onions]] and ] [an[other] match]";
List<String> balanced = getBalancedSubstrings(input, '[', ']', true);
System.out.println("Balanced ones: " + balanced);
List<String> rx_split = new ArrayList<String>();
for (String item : balanced) {
rx_split.add("\\s*" + Pattern.quote(item) + "\\s*");
}
String rx = String.join("|", rx_split);
System.out.println("In-betweens: " + Arrays.toString(input.split(rx)));
And this function will find all []
-balanced substrings:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers) {
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenBracket = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenBracket = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenBracket, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
See IDEONE demo
Result of the code execution:
Balanced ones: ['[sandwich], [pickles], [dried [onions]]', '[an[other] match]']
In-betweens: ['Jim ate a', 'with', 'and', 'and ]']
Credits: the getBalancedSubstrings
is based on the peter.murray.rust's answer for How to split this “Tree-like” string in Java regex? post.
Upvotes: 2