Reputation: 1370

Get language from UTF8

I have several utf8 strings and need to find out the language based on the characters used. It is not important to distuingish between language using the latin alphabet like German, Dutch and English. The language that occur are arabic, korean, chinese, japanese, i.e. language with a definite character set. The strings itself are names in most cases and it can be assumed that the first character is enough for recognition.

Upvotes: 1

Answers (3)

Karol S

Reputation: 9402

The easiest way can be using icu4j library and the method UScript.getScript(int)

It detects the script per character basis. For punctuation and spacing, it returns UScript.COMMON. For Latin, it returns UScript.LATIN. For Chinese and Japanese kanji, it returns UScript.HAN. For Japanese kana, it returns UScript.KATAKANA or UScript.HIRAGANA (so one HAN doesn't prove the text is Chinese and not Japanese).

It's recommended that you iterate over codepoints of your string, but in most cases iterating over chars is enough.

Here's some more theory: https://en.wikipedia.org/wiki/Script_%28Unicode%29

And here's the table with scripts defined for all the characters: http://www.unicode.org/Public/UNIDATA/Scripts.txt

Upvotes: 2

Solomon Slow

Reputation: 27115

One way to do it would be, for each language, keep a list of ordered pairs (c, f) where c is a unique character from the language, and f is the frequency of occurrence of that character in some reasonable corpus from that language. (Call those lists, "character histograms".)

Then, for each document, compute a character histogram from the document, and compare it to all of the known languages. Go with whatever is the closest match.

A better way would be to compare word histograms.

A practical way would be... I don't know.

Upvotes: 0

Kalpesh Soni

Reputation: 7257

In theory you can have a String (unicode 16) in java with german and chinese

you can probably maintain a list of chinese chars frequently occuring, and if they exist, assume that its chinese etc

Upvotes: 0

Get language from UTF8

Answers (3)

Related Questions