user833970
user833970

Reputation: 2799

the regular expression \p{Punct} misses unicode punctuation in java

I wrote a little test to demonstrate

@Test
public void missingPunctuationRegex() {
    Pattern punct = Pattern.compile("[\\p{Punct}]");

    Matcher m = punct.matcher("'");
    assertTrue("ascii puctuation", m.find());

    m = punct.matcher("‘");
    assertTrue("unicode puctuation", m.find());
}

The first assert passes, and the second one fails. You may have to squint to see it, but that is the 'LEFT SINGLE QUOTATION MARK' (U+2018) and should be covered as a punctuation as far as I can tell.

How would I match ALL punctuations in Java regular expressions?

Upvotes: 6

Views: 5406

Answers (2)

Joni
Joni

Reputation: 111349

You can use the UNICODE_CHARACTER_CLASS flag to make \p{Punct} match all Unicode punctuation.

Upvotes: 8

Sotirios Delimanolis
Sotirios Delimanolis

Reputation: 280132

The Javadoc of Pattern states

\p{Punct} Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_{|}~`

You'll have to match it explicitly as it is not considered as part of \p{Punct}.

Pattern punct = Pattern.compile("[\\p{Punct}‘]");

Upvotes: 2

Related Questions