Split regex with multi char delimiters

Question

I'm battling to find the correct way to split a string using delimiters with multiple characters in Java (e.g. '. [1a]' or '.(2b)')

Here's a test case:

String str1 = "This is test 1  .  This is test 2  [2 b]. This is test 3 (3). This is test 4.[4a] This is a test 5 . This is test 6 . (6,six)";

Pattern regex = Pattern.compile("\.\s{0,}$$.*$$\s{0,}|\.\s{0,}$.*$\s{0,}|\.\s{0}");

System.out.println(Arrays.toString(regex.split(text)));

The output that I'm aiming for is the following (spaces in the beginning or end of each sub-string are fine, the important thing is to keep the delimiter):

[This is test 1 . , This is test 2 [2 b]. , This is test 3 (3). , This is test 4.[4a] , This is a test 5 . , This is test 6 . (6,six)]

However, this is the output I'm getting:

[This is test 1 , This is test 2 [2 b], This is test 3 (3), This is test 4, This is a test 5 , This is test 6 ]

Also tried dropping the "\s", a different notation for spaces like Pattern.compile("\s+$$.?$$\s+\.|\s+$.?$\s+\.|\.\s+") and experimented with lookaheads like Pattern.compile("(?<=[.[*]\s+])|(?=[.(*)]\s+)|\.") but neither helped :|

Nikolas · Accepted Answer

This might be a bit tricky. Focus on the common characteristics that the wanted group ends when the next one begins - there is a letter \w so use that to detect a new group.

Use this advantage to replace it with self and the before it, thus $1 and each group will appear on a new line which is fairly easy to extract. The wanted Regex (see Regex101) is :

(?




Mind the one  (space) at the first character of the Regex!


This would produce an output as:

This is test 1  . 
This is test 2  [2 b].
This is test 3 (3).
This is test 4.[4a]
This is a test 5 .
This is test 6 . (6,six)


In Java, the code would be using the methods replaceAll and split (thanks @jmng for the improvement):

String str1 = "This is test 1  .  This is test 2  [2 b]. This is test 3 (3). This is test 4.[4a] This is a test 5 . This is test 6 . (6,six)";

Pattern reg1 = Pattern.compile(" (?

Split regex with multi char delimiters

Answers (2)

Related Questions