Reputation: 25

unicode characters

In my application I have unicode strings, I need to tell in which language the string is in, I want to do it by narrowing list of possible languages by determining in which range the characters of string are.

Ranges I have from http://jrgraphix.net/research/unicode_blocks.php

And possible languages from http://unicode-table.com/en/

The problem is that algorithm has to detect all languages, does someone know more wide mapping of unicode ranges to languages ?

Thanks Wojciech

Upvotes: 0

Answers (2)

Sebastian Negraszus

Reputation: 12215

This is not really possible, for a couple of reasons:

Many languages share the same writing system. Look at English and Dutch, for example. Both use the Basic Latin alphabet. By only looking at the range of code points, you simply cannot distinguish between them.
Some languages use more characters, but there is no guarantee that a specific piece of text contains them. German, for example, uses the Basic Latin alphabet plus "ä", "ö", "ü" and "ß". While these letters are not particularly rare, you can easily create whole sentences without them. So, a short text might not contain them. Thus, again, looking at code points alone is not enough.
Text is not always "pure". An English text may contain French letters because of a French loanword (e.g. "déjà vu"). Or it may contain foreign words, because the text is talking about foreign things (e.g. "Götterdämmerung is an opera by Richard Wagner...", or "The Great Wall of China (万里长城) is..."). Looking at code points alone would be misleading.

To sum up, no, you cannot reliably map code point ranges to languages.

What you could do: Count how often each character appears in the text and heuristically compare with statistics about known languages. Or analyse word structures, e.g. with Markov chains. Or search for the words in dictionaries (taking inflection, composition etc. into account). Or a combination of these.

But this is hard and a lot of work. You should rather use an existing solution, such as those recommended by deceze and Esailija.

Upvotes: 2

ndp

Reputation: 22006

I like the suggestion of using something like google translate -- as they will be doing all the work for you.

You might be able to build a rule-based system that gets you part of the way there. Build heuristic rules for languages and see if that is sufficient. Certain Tibetan characters do indicate Tibetan, and there are unique characters in many languages that will be a give away. But as the other answer pointed out, a limited sample of text may not be that accurate, as you may not have a clear indicator.

Languages will however differ in the frequencies that each character appears, so you could have a basic fingerprint of each language you need to classify and make guesses based on letter frequency. This probably goes a bit further than a rule-based system. Probably a good tool to build this would be a text classification algorithm, which will do all the analysis for you. You would train an algorithm on different languages, instead of having to articulate the actual rules yourself.

A much more sophisticated version of this is presumably what Google does.

Upvotes: 0

unicode characters

Answers (2)

Related Questions