Reputation: 2568
I'm battling to find the correct way to split a string using delimiters with multiple characters in Java (e.g. '. [1a]' or '.(2b)')
Here's a test case:
String str1 = "This is test 1 . This is test 2 [2 b]. This is test 3 (3). This is test 4.[4a] This is a test 5 . This is test 6 . (6,six)";
Pattern regex = Pattern.compile("\\.\\s{0,}\\[.*\\]\\s{0,}|\\.\\s{0,}\\(.*\\)\\s{0,}|\\.\\s{0}");
System.out.println(Arrays.toString(regex.split(text)));
The output that I'm aiming for is the following (spaces in the beginning or end of each sub-string are fine, the important thing is to keep the delimiter):
[This is test 1 . , This is test 2 [2 b]. , This is test 3 (3). , This is test 4.[4a] , This is a test 5 . , This is test 6 . (6,six)]
However, this is the output I'm getting:
[This is test 1 , This is test 2 [2 b], This is test 3 (3), This is test 4, This is a test 5 , This is test 6 ]
Also tried dropping the "\\s", a different notation for spaces like Pattern.compile("\\s+\\[.?\\]\\s+\\.|\\s+\\(.?\\)\\s+\\.|\\.\\s+")
and experimented with lookaheads like Pattern.compile("(?<=[.[*]\\s+])|(?=[.(*)]\\s+)|\\.")
but neither helped :|
Upvotes: 1
Views: 189
Reputation: 44496
This might be a bit tricky. Focus on the common characteristics that the wanted group ends when the next one begins - there is a letter \w
so use that to detect a new group.
Use this advantage to replace it with self and the \n
before it, thus \n$1
and each group will appear on a new line which is fairly easy to extract. The wanted Regex (see Regex101) is :
(?<!\w )(\w)(?=\w{2,})
(space) at the first character of the Regex!This would produce an output as:
This is test 1 .
This is test 2 [2 b].
This is test 3 (3).
This is test 4.[4a]
This is a test 5 .
This is test 6 . (6,six)
In Java, the code would be using the methods replaceAll
and split
(thanks @jmng for the improvement):
String str1 = "This is test 1 . This is test 2 [2 b]. This is test 3 (3). This is test 4.[4a] This is a test 5 . This is test 6 . (6,six)";
Pattern reg1 = Pattern.compile(" (?<!\\w )(\\w)(?=\\w{2,})"); // Preparation
Pattern regNewline = Pattern.compile("\n"); // Split
String[] array = regNewline.split(reg1.matcher(str1).replaceAll("\n$1")); // Apply
Arrays.stream(array).forEach(System.out::println); // Test it
Upvotes: 2
Reputation: 163632
One possibility if spaces in the beginning or end of each sub-string are acceptable and using split could be to use an alternation with a positive lookbehind checking for your different requirements.
In Java you have to determine the minimum and maximum possible lengths of the lookbehind so you might for example take 10 for your example data.
(?<=\[[^]]{1,10}]\.|\.\[[^]]{1,10}]|\([^)]{1,10}\)\.| \. (?!\([^)]+\)))
In Java:
(?<=\\[[^]]{1,10}]\\.|\\.\\[[^]]{1,10}]|\\([^)]{1,10}\\)\\.| \\. (?!\\([^)]+\\)))
Explanation
(?<=
Positive lookbehind to check what is on the left is
\[[^]]{1,10}]\.
Use a negated character class to match between square brackets and a quantifier that repeats not a closing bracket 1 - 10 times followed by a dot|
Or\.\[[^]]{1,10}]
Match a dot and use a negated character class to match between square brackets and a quantifier that repeats not a closing bracket 1 - 10 times|
Or\([^)]{1,10}\)\.
Use a negated character class to match between parenthesis and a quantifier that repeats not a closing parenthesis 1 - 10 times|
Or\. (?!\([^)]+\))
A space, dot and a space if what follows is not anything between parenthesis)
Close positive lookbehindUpvotes: 1