Why does p{Punct} regex leave commas in this example

Question

I need regex to seperate words in lines, here is what I have

 public class Example {
    public static void main(String[] args) {
countWords(List.of("Some random stuff, another stuff, words in quotes „Example“, Oops!"));

    }
    public static void countWords(List lines) {
        lines.
                stream().map(line -> line.split("[\\p{Punct}«»\s\d„“…)–]+")).forEach(e -> System.out.println(Arrays.toString(e)));
 }
    }

But the result is

[Some, ra, dom, s, ff,, a, o, her, s, ff,, words, i, q, o, es, Exam, le, ,, Oo, s!]

As you can see we have words split up, extra commas and exclamation mark left ( I thought p{Punct} includes exclamation marks )

Sweeper · Accepted Answer

Unescaping the Java string literal "\\p{Punct}", we get:

\p{Punct}

In a character class, this is understood as a backslash character, and the characters p, {, P, u, n, c, t, }, clearly not what you want.

You have added an extra backslash in the regex. Just like \d or \s, \p{XXX} only needs one backslash as the prefix, even when it is used in a character class. So you should remove two backslashes from your Java string literal:

"[\p{Punct}«»\s\d„“…)–]+"

Why does p{Punct} regex leave commas in this example

Answers (2)

Related Questions