IddoG
IddoG

Reputation: 291

Java: how to check if character belongs to a specific unicode block?

I need to identify what natural language my input belongs to. The goal is to distinguish between Arabic and English words in a mixed input, where the input is Unicode and is extracted from XML text nodes. I have noticed the class Character.UnicodeBlock. Is it related to my problem? How can I get it to work?

Edit: The Character.UnicodeBlock approach was useful for Arabic, but apparently doesn't do it for English (or other European languages) because the BASIC_LATIN Unicode block covers symbols and non-printable characters as well as letters. So now I am using the matches() method of the String object with the regex expression "[A-Za-z]+" instead. I can live with it, but perhaps someone can suggest a nicer/faster way.

Upvotes: 18

Views: 13339

Answers (5)

nwellnhof
nwellnhof

Reputation: 33638

The Unicode Script property is probably more useful. In Java, it can be looked up using the java.lang.Character.UnicodeScript class:

Character.UnicodeScript script = Character.UnicodeScript.of(c);

Upvotes: 3

james.garriss
james.garriss

Reputation: 13397

English characters tend to be in these 4 Unicode blocks:

ArrayList<Character.UnicodeBlock> english = new ArrayList<>();
english.add(Character.UnicodeBlock.BASIC_LATIN);
english.add(Character.UnicodeBlock.LATIN_1_SUPPLEMENT);
english.add(Character.UnicodeBlock.LATIN_EXTENDED_A);
english.add(Character.UnicodeBlock.GENERAL_PUNCTUATION);

So if you have a String, you can loop over all the characters and see what Unicode block each character is in:

for (char currentChar : myString.toCharArray())  
{
    Character.UnicodeBlock unicodeBlock = Character.UnicodeBlock.of(currentChar);
    if (english.contains(unicodeBlock))
    {
        // This character is English
    }
}

If they are all English, then you know you have characters that all English. You could repeat this for any language; you'll just have to figure out what Unicode blocks each language uses.

Note: This does NOT mean that you've proven the language is English. You've only proven it uses characters found in English. It could be French, German, Spanish, or other languages whose characters have a lot of overlap with English.

There are other ways to detect the actual natural language. Libraries like langdetect, which I have used with great success, can do this for you:

https://code.google.com/p/language-detection/

Upvotes: 1

Dennis C
Dennis C

Reputation: 24747

Yes, you can simply use Character.UnicodeBlock.of(char)

Upvotes: 21

Alan Moore
Alan Moore

Reputation: 75242

If [A-Za-z]+ meets your requirement, you aren't going to find anything faster or prettier. However, if you want to match all letters in the Latin1 block (including accented letters and ligatures), you can use this:

Pattern p = Pattern.compile("[\\pL&&\\p{L1}]+");

That's the intersection of the set of all Unicode letters and the set of all Latin1 characters.

Upvotes: 4

Fernando Migu&#233;lez
Fernando Migu&#233;lez

Reputation: 11316

You have the opposite problem to this one, but ironically what doesn't work for him it just should work great for you. It is to just look for words in English (only ASCII compatible chars) with reg-exp "\w".

Upvotes: 0

Related Questions