Reputation: 113
First of all, I want to remove all punctuation signs in a String. I wrote the following code.
Pattern pattern = Pattern.compile("\\p{Punct}");
Matcher matcher = pattern.matcher("!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~(hello)");
if (matcher.find())
System.out.println(matcher.replaceAll(""));
After replacement I got this output: (hello)
.
So the pattern matches the one of !"#$%&'()*+,-./:;<=>?@[\]^_
{|}~`, which matches the official docs.
But I want to remove "(" Fullwidth Left Parenthesis U+FF08*
and ")" Fullwidth Right Parenthesis U+FF09
as well, so I changed my code to this:
Pattern pattern = Pattern.compile("(?U)\\p{Punct}");
Matcher matcher = pattern.matcher("!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~()");
if (matcher.find())
System.out.println(matcher.replaceAll(""));
After replacement, I got this output: $+<=>^
|~`
It indeed matched "(" Fullwidth Left Parenthesis U+FF08*
and ")" Fullwidth Right Parenthesis U+FF09
, bit it missed $+<=>^
|~`.
I am so confused. Why did that happen? Can anyone give some help?
Upvotes: 6
Views: 458
Reputation: 273275
Unicode (that is when you use (?U)
) and POSIX (when not using (?U)
) disagrees on what counts as a punctuation.
When you don't use (?U)
, \p{Punct}
matches the POSIX punctuation character class, which is just
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
When you use (?U)
, \p{Punct}
matches the Unicode Punctuation category, which does not include some of the characters in the above list, namely:
$+<=>^`|~
For example, the Unicode category for $
is "Symbol, Currency", or Sc. See here.
If you want to match $+<=>^`|~, plus all the Unicode punctuations, you can put them both in a character class. You can also just directly use the Unicode category "P", rather than turning on Unicode mode with (?U)
.
Pattern pattern = Pattern.compile("[\\p{P}$+<=>^`|~]");
Matcher matcher = pattern.matcher("!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~()");
// you don't need "find" first
System.out.println(matcher.replaceAll(""));
Upvotes: 9