NLP NaiveBayesClassifier for utf-8 in Python NLTK

Question

I'm trying to use NLTK to perform some NLP NLTK classification for Arabic phrases. If I enter the native words as is in the classifier then it complains about non-ascii characters. Currently, I'm doing word.decode('utf-8') and then entering that as input to the trainer.

When I test the classifier, the results make some sense if there was an exact match. However, if I test a substring of a word in the original training words then results looks somewhat random.

I just want to distinguish if this was a bad classifier or if there's something fundamental in the encoding that degrades the performance of the classifier. Is this a reasonable way to input non-ascii text to classifiers?

#!/usr/bin/python
# -*- coding: utf-8 -*-

from textblob.classifiers import NaiveBayesClassifier

x = "الكتاب".decode('utf-8')
...

train = [
(x,'pos'),
]

cl = NaiveBayesClassifier(train)

t = "كتاب".decode('utf-8')
cl.classify(t)

The word in t is simply x with the first two letters removed. I'm running this with a much bigger dataset of course.

NLP NaiveBayesClassifier for utf-8 in Python NLTK

Answers (1)

Related Questions