Reputation: 5117
I am testing the polyglot
package in Python to detect the languages in a mixed languages document.
I am not expecting from it the most accurate prediction but to start with the package does not return anything but one language as an answer even for texts which have 2 or 3 languages in them.
The texts which I am using have on average 20 words such as the following:
text = 'Je travaillais en France. Je suis tres heureux. I work in London. I grew up in Manchester.'
I always get something like the following - no multiple languages answer:
Prediction is reliable: True
Language 1: name: English code: en confidence: 98.0 read bytes: 682
Language 2: name: un code: un confidence: 0.0 read bytes: 0
Language 3: name: un code: un confidence: 0.0 read bytes: 0
It is nowhere near the example at its docs:
> China (simplified Chinese: 中国; traditional Chinese: 中國),
>
> name: English code: en confidence: 71.0 read bytes: 887
> name: Chinese code: zh_Hant confidence: 11.0 read bytes: 1755
> name: un code: un confidence: 0.0 read bytes: 0
Even though to be honest when I am running the detector with the Chinese-English example above I do get a mixed languages answer.
The code is simply the following:
from polyglot.detect import Detector
text = 'Je travaillais en France. Je suis tres heureux. I work in London. I grew up in Manchester.'
answer = Detector(text)
print(answer)
Why is this happening?
P.S.
Also, in the case of detecting the language of one (even very common) word polyglot
is pretty bad.
For example for the word quantita
(which is Italian) it gives back English.
I know that many of the these packages can be mainly succesful when having a large text but it is surprising that they cannot even capture even these simplE words.
Textblob
seems to be even good with single words but you can send a very limited number of requests to it (in both cases perhaps because it uses the Google API).
Upvotes: 1
Views: 1263
Reputation: 11
Just use this approach
print(Detector('Your_text', quiet=True))
Also dont forget to download language packages. In order to download all language packages for transliteration Mode you can use:
from polyglot.downloader import downloader
downloader.download("TASK:transliteration2", quiet=True)
To download all Modes for specific language, simply run this command from terminal:
polyglot download LANG:ar
I suggest to read complete manual on downloading Modes and language packs here: https://polyglot.readthedocs.io/en/latest/Download.html
Upvotes: 1
Reputation: 11
I think Polyglot detect the language by reading the characters used in the text. The examples you have mentioned above are all written in English(Transliterated). It doesn't matter whether the words are of French, Italian, Spanish, Chinese etc.. langaueg. It will all be detected as English because it's written by using the character sets of English language.
So Polyglot is only useful for the languages which uses Non-Latin characters in it like Greek, Russian, Arabic or Chinese.
That is why you got Chinese language as well in below case, Confidence is low because very few characters are in Chinese and more are Latin characters:
China (simplified Chinese: 中国; traditional Chinese: 中國),
name: English code: en confidence: 71.0 read bytes: 887 name: Chinese code: zh_Hant confidence: 11.0 read bytes: 1755 name: un code: un confidence: 0.0 read bytes: 0
Upvotes: 0