Reputation: 291
I need to identify what natural language my input belongs to.
The goal is to distinguish between Arabic and English words in a mixed input, where the input is Unicode and is extracted from XML text nodes.
I have noticed the class Character.UnicodeBlock
. Is it related to my problem? How can I get it to work?
Edit:
The Character.UnicodeBlock
approach was useful for Arabic, but apparently doesn't do it for English (or other European languages) because the BASIC_LATIN
Unicode block covers symbols and non-printable characters as well as letters.
So now I am using the matches()
method of the String
object with the regex expression "[A-Za-z]+"
instead. I can live with it, but perhaps someone can suggest a nicer/faster way.
Upvotes: 18
Views: 13339
Reputation: 33638
The Unicode Script property is probably more useful. In Java, it can be looked up using the java.lang.Character.UnicodeScript class:
Character.UnicodeScript script = Character.UnicodeScript.of(c);
Upvotes: 3
Reputation: 13397
English characters tend to be in these 4 Unicode blocks:
ArrayList<Character.UnicodeBlock> english = new ArrayList<>();
english.add(Character.UnicodeBlock.BASIC_LATIN);
english.add(Character.UnicodeBlock.LATIN_1_SUPPLEMENT);
english.add(Character.UnicodeBlock.LATIN_EXTENDED_A);
english.add(Character.UnicodeBlock.GENERAL_PUNCTUATION);
So if you have a String, you can loop over all the characters and see what Unicode block each character is in:
for (char currentChar : myString.toCharArray())
{
Character.UnicodeBlock unicodeBlock = Character.UnicodeBlock.of(currentChar);
if (english.contains(unicodeBlock))
{
// This character is English
}
}
If they are all English, then you know you have characters that all English. You could repeat this for any language; you'll just have to figure out what Unicode blocks each language uses.
Note: This does NOT mean that you've proven the language is English. You've only proven it uses characters found in English. It could be French, German, Spanish, or other languages whose characters have a lot of overlap with English.
There are other ways to detect the actual natural language. Libraries like langdetect, which I have used with great success, can do this for you:
https://code.google.com/p/language-detection/
Upvotes: 1
Reputation: 75242
If [A-Za-z]+
meets your requirement, you aren't going to find anything faster or prettier. However, if you want to match all letters in the Latin1 block (including accented letters and ligatures), you can use this:
Pattern p = Pattern.compile("[\\pL&&\\p{L1}]+");
That's the intersection of the set of all Unicode letters and the set of all Latin1 characters.
Upvotes: 4
Reputation: 11316
You have the opposite problem to this one, but ironically what doesn't work for him it just should work great for you. It is to just look for words in English (only ASCII compatible chars) with reg-exp "\w".
Upvotes: 0