user15599360
user15599360

Reputation:

Why does p{Punct} regex leave commas in this example

I need regex to seperate words in lines, here is what I have

 public class Example {
    public static void main(String[] args) {
countWords(List.of("Some random stuff, another stuff, words in quotes „Example“, Oops!"));

    }
    public static void countWords(List<String> lines) {
        lines.
                stream().map(line -> line.split("[\\\\p{Punct}«»\\s\\d„“…)–]+")).forEach(e -> System.out.println(Arrays.toString(e)));
 }
    }

But the result is

[Some, ra, dom, s, ff,, a, o, her, s, ff,, words, i, q, o, es, Exam, le, ,, Oo, s!]

As you can see we have words split up, extra commas and exclamation mark left ( I thought p{Punct} includes exclamation marks )

Upvotes: 0

Views: 62

Answers (2)

Sweeper
Sweeper

Reputation: 273275

Unescaping the Java string literal "\\\\p{Punct}", we get:

\\p{Punct}

In a character class, this is understood as a backslash character, and the characters p, {, P, u, n, c, t, }, clearly not what you want.

You have added an extra backslash in the regex. Just like \d or \s, \p{XXX} only needs one backslash as the prefix, even when it is used in a character class. So you should remove two backslashes from your Java string literal:

"[\\p{Punct}«»\\s\\d„“…)–]+"

Upvotes: 2

puelo
puelo

Reputation: 6047

It is not clear what you are trying to accomplish, but the output matches your code. Your regex matches the occurrences of any of the following: \, p, {, P, u, n, c, t, }, «, », any whitespace, any digit, „, “, …, ), –.

Thus your line is split at a lot of places. The extra commas are because of how toString of Arrays is implemented.

Here is a helpful resource to check what your regex is actually matching: https://regex101.com/r/bHP5lu/1

Upvotes: 0

Related Questions