Reputation: 1
I need to safely split Java string into words and punctuation.
I have tried this code, but have the problem, that it doesn't separate brackets correctly.
String sentenceString = "Hello from the outside(outside).";
sentenceString.split("(?=,|\\.|!|\\?|\\(|\\))|\\s");
Actual results are
["Hello", "from", "the", "outside", "", "(outside", ")", "."]
Expected result shoud be
["Hello", "from", "the", "outside", "(", "outside", ")", "."]
Upvotes: 0
Views: 94
Reputation: 18357
Instead of split, you should try matching the regex to get your desired output. Try using this regex in Java,
[a-zA-Z]+|\\p{Punct}
Here [a-zA-Z]+
part matches one or more alphabets and \\p{Punct}
part matches any punctuation character, and if you're familiar with POSIX
representation then it is equivalent to [[:punct:]]
. People trying to apply similar solution to languages/tools supporting POSIX
representation can use [a-zA-Z]+|[[:punct:]]
regex.
Java code,
List<String> list = new ArrayList<String>();
String s = "Hello from the outside(outside).";
Pattern p = Pattern.compile("[a-zA-Z]+|\\p{Punct}");
Matcher m = p.matcher(s);
while (m.find()) {
list.add(m.group());
}
System.out.println(list);
Prints the output like you wanted,
[Hello, from, the, outside, (, outside, ), .]
Edit: Thanks to Andreas for his nice suggestion. If you want to include letters not just only from English but other languages as well, then better to use this regex,
\\p{L}+|\\p{P}
As, \\p{L}
will not only cover English but any other language's alphabets represented in Unicode.
But, notice, this may come at a little increased cost in performance, because now, it may be trying to match not just [a-z]
but other Unicode characters too. So there is a little trade off, so use the one that suits your needs better.
Thanks again Andreas for your valuable suggestion.
Upvotes: 2