Reputation: 185
I need to split a string (in Java) with punctuation marks being stored in the same array as words:
String sentence = "In the preceding examples, classes derived from...";
String[] split = sentence.split(" ");
I need split array to be:
split[0] - "In"
split[1] - "the"
split[2] - "preceding"
split[3] - "examples"
split[4] - ","
split[5] - "classes"
split[6] - "derived"
split[7] - "from"
split[8] - "..."
Is there any elegant solution?
Upvotes: 7
Views: 3343
Reputation: 425198
You need a look arounds:
String[] split = sentence.split(" ?(?<!\\G)((?<=[^\\p{Punct}])(?=\\p{Punct})|\\b) ?");
Look arounds assert, but (importantly here) don't consume the input when matching.
Some test code:
String sentence = "Foo bar, baz! Who? Me...";
String[] split = sentence.split(" ?(?<!\\G)((?<=[^\\p{Punct}])(?=\\p{Punct})|\\b) ?");
Arrays.stream(split).forEach(System.out::println);
Output;
Foo
bar
,
baz
!
Who
?
Me
...
Upvotes: 2
Reputation: 124265
For now I will say that easiest and probably cleanest way to achieve what you want is to focus on finding data you want in array, rather than finding place to split your text on.
I am saying this because split
introduces a lot of problems like for instance:
split(" +|(?=\\p{Punct})");
will split only on space and before punctuation character, which means that text like "abc" def
will be split into "abc
"
def
. So as you see it doesn't split after "
in "abc
.
previous problem can be solved easily by adding another |(?<=\\p{Punct})
condition like split(" +|(?=\\p{Punct})|(?<=\\p{Punct})")
, but we still didn't solve all of your problems because of ...
. So we need to figure out way to prevent splitting in between these dots .|.|.
.
.
from \p{Punct}
and trying to handle it separately but this would make our regex quite complex. ...
with some unique string, adding this string in our split
logic and after all replacing it back to ...
in our result array. But this approach would also require from us to know what string will never be possible to have in your text, so we will need to generate it each time we parse text."
. So in Java 7 "foo" bar
string split on (?=\p{Punct)
will result in [ , "foo, " bar]
elements. To avoid this problem you would need to add regex like (?!^)
to prevent splitting at start of the string.Anyway these solutions looks overly complex.
So instead of split
method consider using find
method from Matcher
class and focus on what you want to have in result array.
Try using pattern like this one: [.]{3}|\p{Punct}|[\S&&\P{Punct}]+"
[.]{3}
will match ...
\p{Punct}
will match single punctuation character which according to documentation is one of !"#$%&'()*+,-./:;<=>?@[]^_`{|}~
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
:
;
<
=
>
?
@
[
\
]
^
_
`
{
|
}
~
[\S&&\P{Punct}]+
will match one or more characters which are
\S
not whitespaces &&
and\P{Punct}
not punctuation characters (\P{foo}
is negation of \p{foo}
).Demo:
String sentence = "In (the) preceding examples, classes derived from...";
Pattern p = Pattern.compile("[.]{3}|\\p{Punct}|[\\S&&\\P{Punct}]+");
Matcher m = p.matcher(sentence);
while(m.find()){
System.out.println(m.group());
}
Output:
In
(
the
)
preceding
examples
,
classes
derived
from
...
Upvotes: 1
Reputation: 10925
You may try by replacing triple dots with ellipsis character first:
String sentence = "In the preceding examples, classes derived from...";
String[] split = sentence.replace("...", "…").split(" +|(?=,|\\p{Punct}|…)");
Afterwards you can leave it as it is or convert it back by running replace("…", "...")
on entire array.
Upvotes: 1
Reputation: 3495
another example here. this solution probably works for all combinations.
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class App {
public static void main(String[] args) {
String sentence = "In the preceding examples, classes derived from...";
List<String> list = splitWithPunctuation(sentence);
System.out.println(list);
}
public static List<String> splitWithPunctuation(String sentence) {
Pattern p = Pattern.compile("([^a-zA-Z\\d\\s]+)");
String[] split = sentence.split(" ");
List<String> list = new ArrayList<>();
for (String s : split) {
Matcher matcher = p.matcher(s);
boolean found = false;
int i = 0;
while (matcher.find()) {
found = true;
list.add(s.substring(i, matcher.start()));
list.add(s.substring(matcher.start(), matcher.end()));
i = matcher.end();
}
if (found) {
if (i < s.length())
list.add(s.substring(i, s.length()));
} else
list.add(s);
}
return list;
}
}
Output:
In
the
preceding
examples
,
classes
derived
from
...
A more complex example:
String sentence = "In the preced^^^in## examp!les, classes derived from...";
List<String> list = splitWithPunctuation(sentence);
System.out.println(list);
Output:
In
the
preced
^^^
in
##
examp
!
les
,
classes
derived
from
...
Upvotes: 0
Reputation: 2308
For your particular case the two main challenges are the ordering (e.g. first punctuation and then word or the other way around) and the ...
punctuation.
The rest you can easily implement it using
\p{Punct}
like this:
Pattern.compile("\p{Punct}");
Regarding the two mentioned challenges:
1.Ordering: You can try the following:
private static final Pattern punctuation = Pattern.compile("\\p{Punct}");
private static final Pattern word = Pattern.compile("\\w");
public static void main(String[] args) {
String sentence = "In the preceding examples, classes derived from...";
String[] split = sentence.split(" ");
List<String> result = new LinkedList<>();
for (String s : split) {
List<String> withMarks = splitWithPunctuationMarks(s);
result.addAll(withMarks);
}
}
private static void List<String> splitWithPunctuationMarks(String s) {
Map<Integer, String> positionToString = new TreeMap<>();
Matcher punctMatcher = punctuation.matcher(s);
while (punctMatcher.find()) {
positionToString.put(punctMatcher.start(), punctMatcher.group())
}
Matcher wordMatcher = // ... same as before
// Then positionToString.values() will contain the
// ordered words and punctuation characters.
}
...
You can try to look back for previous occurrences of the .
character at (currentIndex - 1) every time you find it.Upvotes: 0
Reputation: 3137
I believe this method will do what you want
public static List<String> split(String str) {
Pattern pattern = Pattern.compile("(\\w+)|(\\.{3})|[^\\s]");
Matcher matcher = pattern.matcher(str);
List<String> list = new ArrayList<String>();
while (matcher.find()) {
list.add(matcher.group());
}
return list;
}
It will split a string into
...
For this example
"In the preceding examples, classes.. derived from... Hello, World! foo!bar"
The list will be
[0] In
[1] the
[2] preceding
[3] examples
[4] ,
[5] classes
[6] .
[7] .
[8] derived
[9] from
[10] ...
[11] Hello
[12] ,
[13] World
[14] !
[15] foo
[16] !
[17] bar
Upvotes: 1
Reputation: 208
You could sanitize the string replacing, say "," with " ,", and so on for all punctuation marks you care to distinguish.
In the particular case of "..." you can do:
// there can be series of dots
sentence.replace(".", " .").replace(". .", "..")
Then you split.
EDIT: replaced single quotes with double quotes.
Upvotes: 0