william
william

Reputation: 113

Regex (?U)\p{Punct} is missing some Unicode punctuation signs in Java

First of all, I want to remove all punctuation signs in a String. I wrote the following code.

Pattern pattern = Pattern.compile("\\p{Punct}");
Matcher matcher = pattern.matcher("!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~(hello)");
if (matcher.find())
    System.out.println(matcher.replaceAll(""));

After replacement I got this output: (hello).

So the pattern matches the one of !"#$%&'()*+,-./:;<=>?@[\]^_{|}~`, which matches the official docs.

But I want to remove "(" Fullwidth Left Parenthesis U+FF08* and ")" Fullwidth Right Parenthesis U+FF09 as well, so I changed my code to this:

Pattern pattern = Pattern.compile("(?U)\\p{Punct}");
        Matcher matcher = pattern.matcher("!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~()");
        if (matcher.find())
            System.out.println(matcher.replaceAll(""));

After replacement, I got this output: $+<=>^|~`

It indeed matched "(" Fullwidth Left Parenthesis U+FF08* and ")" Fullwidth Right Parenthesis U+FF09, bit it missed $+<=>^|~`.

I am so confused. Why did that happen? Can anyone give some help?

Upvotes: 6

Views: 458

Answers (1)

Sweeper
Sweeper

Reputation: 273275

Unicode (that is when you use (?U)) and POSIX (when not using (?U)) disagrees on what counts as a punctuation.

When you don't use (?U), \p{Punct} matches the POSIX punctuation character class, which is just

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

When you use (?U), \p{Punct} matches the Unicode Punctuation category, which does not include some of the characters in the above list, namely:

$+<=>^`|~

For example, the Unicode category for $ is "Symbol, Currency", or Sc. See here.

If you want to match $+<=>^`|~, plus all the Unicode punctuations, you can put them both in a character class. You can also just directly use the Unicode category "P", rather than turning on Unicode mode with (?U).

Pattern pattern = Pattern.compile("[\\p{P}$+<=>^`|~]");
Matcher matcher = pattern.matcher("!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~()");
// you don't need "find" first
System.out.println(matcher.replaceAll(""));

Upvotes: 9

Related Questions