Reputation: 489
I have regular expression to validate number digits and -. I am now supporting mutibyte characters as well. So I have used unicode class to support but Its not matching. Can some one enlighten me on this
public class Test123 {
public static void main(String[] args) {
String test="熏肉еконcarácterbañlácaractères" ;
Pattern pattern = Pattern.compile("^[a-zA-Z0-9_-]*$",Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher = pattern.matcher(test);
if(matcher.matches())
{
System.out.println("matched");
}
else{
System.out.println("not matched");
}
}
}
Upvotes: 2
Views: 2516
Reputation: 124275
Problem is that despite that flag a-z
doesn't represent "all Unicode alphabetic characters" but only "characters between a
and z
".
UNICODE_CHARACTER_CLASS
flag adds Unicode support only to predefined character classes like \w
which normally represents a-zA-Z0-9_
.
So try with
Pattern.compile("^[\\w-]*$",Pattern.UNICODE_CHARACTER_CLASS);
Upvotes: 1
Reputation: 67988
[\\p{L}\\p{M}]+
You can use this to match unicode
letters.
\p{L} matches any kind of letter from any language
\p{M} matches a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)
See demo.
https://regex101.com/r/fM9lY3/30
Upvotes: 0
Reputation: 48444
You can use the posix class \\p{Alpha}
, instead of literal classes with [a-zA-Z]
to match unicode and accented characters.
Example
String test = "熏肉еконcarácterbañlácaractères";
Pattern pattern = Pattern.compile("\\p{Alpha}+", Pattern.UNICODE_CHARACTER_CLASS);
Matcher m = pattern.matcher(test);
while (m.find()) {
System.out.println(m.group());
}
Output
熏肉еконcarácterbañlácaractères
Upvotes: 4