shreekanth
shreekanth

Reputation: 489

Regular expression for unicode in java 7

I have regular expression to validate number digits and -. I am now supporting mutibyte characters as well. So I have used unicode class to support but Its not matching. Can some one enlighten me on this

public class Test123 {

    public static void main(String[] args) {

        String test="熏肉еконcarácterbañlácaractères" ;
        Pattern pattern = Pattern.compile("^[a-zA-Z0-9_-]*$",Pattern.UNICODE_CHARACTER_CLASS);

        Matcher matcher = pattern.matcher(test);
        if(matcher.matches())
        {
            System.out.println("matched");
        }
        else{
            System.out.println("not matched");
        }
    }

}

Upvotes: 2

Views: 2516

Answers (3)

Pshemo
Pshemo

Reputation: 124275

Problem is that despite that flag a-z doesn't represent "all Unicode alphabetic characters" but only "characters between a and z".

UNICODE_CHARACTER_CLASS flag adds Unicode support only to predefined character classes like \w which normally represents a-zA-Z0-9_.

So try with

Pattern.compile("^[\\w-]*$",Pattern.UNICODE_CHARACTER_CLASS);

Upvotes: 1

vks
vks

Reputation: 67988

[\\p{L}\\p{M}]+

You can use this to match unicode letters.

\p{L} matches any kind of letter from any language
\p{M} matches a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)

See demo.

https://regex101.com/r/fM9lY3/30

Upvotes: 0

Mena
Mena

Reputation: 48444

You can use the posix class \\p{Alpha}, instead of literal classes with [a-zA-Z] to match unicode and accented characters.

Example

String test = "熏肉еконcarácterbañlácaractères";
Pattern pattern = Pattern.compile("\\p{Alpha}+", Pattern.UNICODE_CHARACTER_CLASS);
Matcher m = pattern.matcher(test);
while (m.find()) {
    System.out.println(m.group());
}

Output

熏肉еконcarácterbañlácaractères

Upvotes: 4

Related Questions