centic
centic

Reputation: 15882

Regular expression to match all content including unicode punctuation mark

I am trying to match some text which may include unicode characters including special punctuation marks like (\u0085 in Java).

When I do something like

Matcher testMatcher = Pattern.compile("(.+)", Pattern.UNICODE_CHARACTER_CLASS).matcher("test text up \u0085 after");

I get a match of "test text up", without the punctuation mark, however I would like to match all content. How do I do this?

See also a demonstration in the regex101 tool.

Update: I did try ((?:\P{M}\p{M}*+)+) as discussed at regular-expressions.info, but it does not seem to work in Java.

Upvotes: 3

Views: 635

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627219

The symbol belongs to Cc - Other, control category.

You need to add the Pattern.DOTALL modifier to match it. Or append (?s) at the pattern start.

General category:                Cc - Other, control
Canonical combining class: 0 - Spacing, split, enclosing, reordrant, & Tibetan subjoined
Bidirectional category:          B - Paragraph separator
Unicode 1.0 name:               NEXT LINE (NEL)
Unicode version:                  1.1
As text:
Decimal: 133
HTML escape:                       …
URL escape:                         %C2%85

See details here

And here is an IDEONE demo

Matcher testMatcher = Pattern.compile(".+", Pattern.DOTALL | Pattern.UNICODE_CHARACTER_CLASS).matcher("test text up \u0085 after");
if (testMatcher.find()){
    System.out.println(testMatcher.group(0)); 
} // => test text up  after

Upvotes: 3

Related Questions