doniyor
doniyor

Reputation: 37904

guess_language module is giving UNKNOWN

I installed (I am in Windows 7, but I am using a virtualenv with Python 2.7.5):

pip install pyenchant
pip install 3to2
pip install https://bitbucket.org/spirit/guess_language/downloads/guess_language-spirit-0.5.tar.bz2

and did:

>>> from guess_language import guess_language
>>> guess_language("Hello World")
u'UNKNOWN'

Why am I getting u'UNKNOWN'?

This is the project site.

Upvotes: 1

Views: 525

Answers (1)

Shiplu Mokaddim
Shiplu Mokaddim

Reputation: 57670

I suggest you use nltk for this. it'll be much easier in nltk.

import nltk

STOPWORDS_DICT = {lang: set(nltk.corpus.stopwords.words(lang))
                  for lang in nltk.corpus.stopwords.fileids()}

def get_language(text):
    words = set(nltk.wordpunct_tokenize(text.lower()))
    return max(((lang, len(words & stopwords))
                for lang, stopwords in STOPWORDS_DICT.items()),
               key = lambda x: x[1])[0]

Now see the code in action.

In [28]: get_language('hello world')
Out[28]: 'swedish'

In [30]: get_language('stackoverflow is a nice website')
Out[30]: 'english'

problem is if the sample text is very small it'll give wrong result.

The code is taken from this site.

Upvotes: 2

Related Questions