Reputation: 15882
I am trying to match some text which may include unicode characters including special punctuation marks like (\u0085
in Java).
When I do something like
Matcher testMatcher = Pattern.compile("(.+)", Pattern.UNICODE_CHARACTER_CLASS).matcher("test text up \u0085 after");
I get a match of "test text up", without the punctuation mark, however I would like to match all content. How do I do this?
See also a demonstration in the regex101 tool.
Update: I did try ((?:\P{M}\p{M}*+)+)
as discussed at regular-expressions.info, but it does not seem to work in Java.
Upvotes: 3
Views: 635
Reputation: 627219
The symbol belongs to Cc - Other, control category.
You need to add the Pattern.DOTALL
modifier to match it. Or append (?s)
at the pattern start.
General category: Cc - Other, control
Canonical combining class: 0 - Spacing, split, enclosing, reordrant, & Tibetan subjoined
Bidirectional category: B - Paragraph separator
Unicode 1.0 name: NEXT LINE (NEL)
Unicode version: 1.1
As text:
Decimal: 133
HTML escape:
URL escape: %C2%85
See details here
And here is an IDEONE demo
Matcher testMatcher = Pattern.compile(".+", Pattern.DOTALL | Pattern.UNICODE_CHARACTER_CLASS).matcher("test text up \u0085 after");
if (testMatcher.find()){
System.out.println(testMatcher.group(0));
} // => test text up after
Upvotes: 3