Reputation: 1370
I have several utf8 strings and need to find out the language based on the characters used. It is not important to distuingish between language using the latin alphabet like German, Dutch and English. The language that occur are arabic, korean, chinese, japanese, i.e. language with a definite character set. The strings itself are names in most cases and it can be assumed that the first character is enough for recognition.
Upvotes: 1
Views: 905
Reputation: 9402
The easiest way can be using icu4j library and the method UScript.getScript(int)
It detects the script per character basis. For punctuation and spacing, it returns UScript.COMMON
. For Latin, it returns UScript.LATIN
. For Chinese and Japanese kanji, it returns UScript.HAN
. For Japanese kana, it returns UScript.KATAKANA
or UScript.HIRAGANA
(so one HAN
doesn't prove the text is Chinese and not Japanese).
It's recommended that you iterate over codepoints of your string, but in most cases iterating over char
s is enough.
Here's some more theory: https://en.wikipedia.org/wiki/Script_%28Unicode%29
And here's the table with scripts defined for all the characters: http://www.unicode.org/Public/UNIDATA/Scripts.txt
Upvotes: 2
Reputation: 27115
One way to do it would be, for each language, keep a list of ordered pairs (c, f) where c is a unique character from the language, and f is the frequency of occurrence of that character in some reasonable corpus from that language. (Call those lists, "character histograms".)
Then, for each document, compute a character histogram from the document, and compare it to all of the known languages. Go with whatever is the closest match.
A better way would be to compare word histograms.
A practical way would be... I don't know.
Upvotes: 0
Reputation: 7257
In theory you can have a String (unicode 16) in java with german and chinese
you can probably maintain a list of chinese chars frequently occuring, and if they exist, assume that its chinese etc
Upvotes: 0