Vivek Kumar
Vivek Kumar

Reputation: 5050

Word language detection in C++

After searching on Google I don't know any standard way or library for detecting whether a particular word is of which language.

Suppose I have any word, how could I find which language it is: English, Japanese, Italian, German etc.

Is there any library available for C++? Any suggestion in this regard will be greatly appreciated!

Upvotes: 7

Views: 5858

Answers (8)

Coder12345
Coder12345

Reputation: 3753

Spell check first 3 words of your text in all languages (the more words to spell check, the better). The spelling with least number of spelling errors "wins". With only 3 words it is technically possible to have same spelling in a few languages but with each additional word it becomes less probable. It is not a perfect method, but I figure it would work in most cases.

Otherwise if there is equal number of errors in all languages, use the default language. Or randomly pick another 3 words until you have more clear result. Or expand the number of spell checked words to more than 3, until you get a more clear result as well.

As for the spell checking libraries, there are many, I personally prefer Hunspell. Nuspell is probably also good. It is a matter of personal opinion and/or technical capabilities which one to use.

Upvotes: 0

mrz
mrz

Reputation: 1872

I have found Google's CLD very helpful, it's written in C++, and from their web site:

"CLD (Compact Language Detector) is the library embedded in Google's Chromium browser. The library detects the language from provided UTF8 text (plain text or HTML). It's implemented in C++, with very basic Python bindings."

Upvotes: 3

Dov
Dov

Reputation: 8570

Simple language recognition from words is easy. You don't need to understand the semantics of the text. You don't need any computationally expensive algorithms, just a fast hash map. The problem is, you need a lot of data. Fortunately, you can probably find dictionaries of words in each language you care about. Define a bit mask for each language, that will allow you to mark words like "the" as recognized in multiple languages. Then, read each language dictionary into your hash map. If the word is already present from a different language, just mark the current language also.

Suppose a given word is in English and French. Then when you look it up ex("commercial") will map to ENGLISH|FRENCH, suppose ENGLISH = 1, FRENCH=2, ... You'll find the value 3. If you want to know whether the words are in your lang only, you would test:

int langs = dict["the"];
if (langs | mylang == mylang)
   // no other language



Since there will be other languages, probably a more general approach is better. For each bit set in the vector, add 1 to the corresponding language. Do this for n words. After about n=10 words, in a typical text, you'll have 10 for the dominant language, maybe 2 for a language that it is related to (like English/French), and you can determine with high probability that the text is English. Remember, even if you have a text that is in a language, it can still have a quote in another, so the mere presence of a foreign word doesn't mean the document is in that language. Pick a threshhold, it will work quite well (and very, very fast).

Obviously the hardest thing about this is reading in all the dictionaries. This isn't a code problem, it's a data collection problem. Fortunately, that's your problem, not mine.

To make this fast, you will need to preload the hash map, otherwise loading it up initially is going to hurt. If that's an issue, you will have to write store and load methods for the hash map that block load the entire thing in efficiently.

Upvotes: 3

Potatoswatter
Potatoswatter

Reputation: 137960

This will not work well one word at a time, as many words are shared. For instance, in several languages "the" means "tea."

Language processing libraries tend to be more comprehensive than just this one feature, and as C++ is a "high-performance" language it might be hard to find one for free.

However, the problem might not be too hard to solve yourself. See the Wikipedia article on the problem for ideas. Also a small support vector machine might do the trick quite handily. Just train it with the most common words in the relevant languages, and you might have a very effective "database" in just a kilobyte or so.

Upvotes: 2

bmargulies
bmargulies

Reputation: 100196

Well,

Statistically trained language detectors work surprisingly well on single-word inputs, though there are obviously some cases where they can't possible work, as observed by others here.

In Java, I'd send you to Apache Tika. It has an Open-source statistical language detector.

For C++, you could use JNI to call it. Now, time for a disclaimer warning. Since you specifically asked for C++, and since I'm unaware of a C++ free alternative, I will now point you at a product of my employer, which is a statistical language detector, natively in C++.

http://www.basistech.com, the product name is RLI.

Upvotes: 2

TonyK
TonyK

Reputation: 17124

Basically you need a huge database of all the major languages. To auto-detect the language of a piece of text, pick the language whose dictionary contains the most words from the text. This is not something you would want to have to implement on your laptop.

Upvotes: 1

DevSolar
DevSolar

Reputation: 70401

I wouldn't hold my breath. It is difficult enough to determine the language of a text automatically. If all you have is a single word, without context, you would need a database of all the words of all the languages in the world... the size of which would be prohibitive.

Upvotes: 1

Divyang Mithaiwala
Divyang Mithaiwala

Reputation: 171

I assume that you are working with text not with speech.

If you are working with UNICODE than it has provided slot for each languages.

So you can identify that all characters of particular word is fall in this language slot.

For more help about unicode language slot you can fine over here

Upvotes: -4

Related Questions