Reputation:
I need regex to seperate words in lines, here is what I have
public class Example {
public static void main(String[] args) {
countWords(List.of("Some random stuff, another stuff, words in quotes „Example“, Oops!"));
}
public static void countWords(List<String> lines) {
lines.
stream().map(line -> line.split("[\\\\p{Punct}«»\\s\\d„“…)–]+")).forEach(e -> System.out.println(Arrays.toString(e)));
}
}
But the result is
[Some, ra, dom, s, ff,, a, o, her, s, ff,, words, i, q, o, es, Exam, le, ,, Oo, s!]
As you can see we have words split up, extra commas and exclamation mark left ( I thought p{Punct} includes exclamation marks )
Upvotes: 0
Views: 62
Reputation: 273275
Unescaping the Java string literal "\\\\p{Punct}"
, we get:
\\p{Punct}
In a character class, this is understood as a backslash character, and the characters p
, {
, P
, u
, n
, c
, t
, }
, clearly not what you want.
You have added an extra backslash in the regex. Just like \d
or \s
, \p{XXX}
only needs one backslash as the prefix, even when it is used in a character class. So you should remove two backslashes from your Java string literal:
"[\\p{Punct}«»\\s\\d„“…)–]+"
Upvotes: 2
Reputation: 6047
It is not clear what you are trying to accomplish, but the output matches your code. Your regex matches the occurrences of any of the following: \, p, {, P, u, n, c, t, }, «, », any whitespace, any digit, „, “, …, ), –.
Thus your line is split at a lot of places. The extra commas are because of how toString
of Arrays
is implemented.
Here is a helpful resource to check what your regex is actually matching: https://regex101.com/r/bHP5lu/1
Upvotes: 0