Reputation: 3118
I have been wondering for some time how does Google translate(or maybe a hypothetical translator) detect language from the string entered in the "from" field. I have been thinking about this and only thing I can think of is looking for words that are unique to a language in the input string. The other way could be to check sentence formation or other semantics in addition to keywords. But this seems to be a very difficult task considering different languages and their semantics. I did some research to find that there are ways that use n-gram sequences and use some statistical models to detect language. Would appreciate a high level answer too.
Upvotes: 18
Views: 11704
Reputation: 136197
You might be interested in The WiLI benchmark dataset for written language identification. The high level-answer you can also find in the paper is the following:
Upvotes: 4
Reputation: 563
Take the Wikipedia in English. Check what is the probability that after the letter 'a' comes a 'b' (for example) and do that for all the combination of letters, you will end up with a matrix of probabilities.
If you do the same for the Wikipedia in different languages you will get different matrices for each language.
To detect the language just use all those matrices and use the probabilities as a score, let say that in English you'd get this probabilities:
t->h = 0.3 h->e = .2
and in the Spanish matrix you'd get that
t->h = 0.01 h->e = .3
The word 'the', using the English matrix, would give you a score of 0.3+0.2 = 0.5 and using the Spanish one: 0.01+0.3 = 0.31
The English matrix wins so that has to be English.
Upvotes: 16
Reputation: 61016
An implementation example.
Mathematica is a good fit for implementing this. It recognizes (ie has several dictionaries) words in the following languages:
dicts = DictionaryLookup[All]
{"Arabic", "BrazilianPortuguese", "Breton", "BritishEnglish", \
"Catalan", "Croatian", "Danish", "Dutch", "English", "Esperanto", \
"Faroese", "Finnish", "French", "Galician", "German", "Hebrew", \
"Hindi", "Hungarian", "IrishGaelic", "Italian", "Latin", "Polish", \
"Portuguese", "Russian", "ScottishGaelic", "Spanish", "Swedish"}
I built a little and naive function to calculate the probability of a sentence in each of those languages:
f[text_] :=
SortBy[{#[[1]], #[[2]] / Length@k} & /@ (Tally@(First /@
Flatten[DictionaryLookup[{All, #}] & /@ (k =
StringSplit[text]), 1])), -#[[2]] &]
So that, just looking for words in dictionaries, you may get a good approximation, also for short sentences:
f["we the people"]
{{BritishEnglish,1},{English,1},{Polish,2/3},{Dutch,1/3},{Latin,1/3}}
f["sino yo triste y cuitado que vivo en esta prisión"]
{{Spanish,1},{Portuguese,7/10},{Galician,3/5},... }
f["wszyscy ludzie rodzą się wolni"]
{{"Polish", 3/5}}
f["deutsch lernen mit jetzt"]
{{"German", 1}, {"Croatian", 1/4}, {"Danish", 1/4}, ...}
Upvotes: 5
Reputation: 3607
If you want to implement a lightweight language guesser in the programming language of your choice you can use the method of 'Cavnar and Trenkle '94: N-Gram-Based Text Categorization'. You can find the Paper on Google Scholar and it is pretty straight forward.
Their method builds a N-Gram statistic for every language it should be able to guess afterwards from some text in that language. Then such statistic is build for the unknown text aswell and compared to the previously trained statistics by a simple out-of-place measure. If you use Unigrams+Bigrams (possibly +Trigrams) and compare the 100-200 most frequent N-Grams your hit rate should be over 95% if the text to guess is not too short. There was a demo available here but it doesn't seem to work at the moment.
There are other ways of Language Guessing including computing the probability of N-Grams and more advanced classifiers, but in the most cases the approach of Cavnar and Trenkle should perform sufficiently.
Upvotes: 12
Reputation: 62048
You don't have to do deep analysis of text to have an idea of what language it's in. Statistics tells us that every language has specific character patterns and frequencies. That's a pretty good first-order approximation. It gets worse when the text is in multiple languages, but still it's not something extremely complex. Of course, if the text is too short (e.g. a single word, worse, a single short word), statistics doesn't work, you need a dictionary.
Upvotes: 10