Alex Samutin
Alex Samutin

Reputation: 1

Save split by punctuation

I need to safely split Java string into words and punctuation.

I have tried this code, but have the problem, that it doesn't separate brackets correctly.

String sentenceString = "Hello from the outside(outside).";
sentenceString.split("(?=,|\\.|!|\\?|\\(|\\))|\\s");

Actual results are

["Hello", "from", "the", "outside", "", "(outside", ")", "."]

Expected result shoud be

["Hello", "from", "the", "outside", "(", "outside", ")", "."]

Upvotes: 0

Views: 94

Answers (1)

Pushpesh Kumar Rajwanshi
Pushpesh Kumar Rajwanshi

Reputation: 18357

Instead of split, you should try matching the regex to get your desired output. Try using this regex in Java,

[a-zA-Z]+|\\p{Punct}

Here [a-zA-Z]+ part matches one or more alphabets and \\p{Punct} part matches any punctuation character, and if you're familiar with POSIX representation then it is equivalent to [[:punct:]]. People trying to apply similar solution to languages/tools supporting POSIX representation can use [a-zA-Z]+|[[:punct:]] regex.

Java code,

List<String> list = new ArrayList<String>();
String s = "Hello from the outside(outside).";
Pattern p = Pattern.compile("[a-zA-Z]+|\\p{Punct}");
Matcher m = p.matcher(s);
while (m.find()) {
    list.add(m.group());
}
System.out.println(list);

Prints the output like you wanted,

[Hello, from, the, outside, (, outside, ), .]

Edit: Thanks to Andreas for his nice suggestion. If you want to include letters not just only from English but other languages as well, then better to use this regex,

\\p{L}+|\\p{P}

As, \\p{L} will not only cover English but any other language's alphabets represented in Unicode.

But, notice, this may come at a little increased cost in performance, because now, it may be trying to match not just [a-z] but other Unicode characters too. So there is a little trade off, so use the one that suits your needs better.

Thanks again Andreas for your valuable suggestion.

Upvotes: 2

Related Questions