juliano.net
juliano.net

Reputation: 8177

Improve accuracy for text categorization (currently getting 62% for Naive Bayes and SVM)

I have a dataset like this:

COD| COMPDESC|    CDESCR
0|   10|  STRUCTURE:BODY:DOOR| AUTOMATIC DOOR LOCKS WHEN USED, WILL NOT RELEA...
1|   18|  VEHICLE SPEED CONTROL|   VEHICLE SUDDENLY ACCELERATED OUT OF CONTROL, B...
2|   24|  STEERING:WHEEL AND HANDLE BAR|   STEERING WHEEL BOLTS LOOSENEDAND ROCKED BACK A...
3|   40|  SUSPENSION:FRONT:MACPHERSON STRUT|   MISALIGNMENT, CAUSING VEHICLE TO PULL TO THE R...
4|   55|  STEERING:WHEEL AND HANDLE BAR|   DUE TO DEFECT STEERING BOLTS, STEERING WHEEL I...

I tried to use Naive Bayes and SVM for the prediction after using NLTK for stemming and applying CountVectorizer, but the prediction is much lower than this article that uses a dataset with just 20.000 rows (mine has 1 million, but I can only use 100.000 rows at a time because of memory limits).

I tried with ngram-range: (1,1) and ngram-range: (1,2) and the results were almost the same. And the last, required more memory, thus I had to decrease the number of rows being processed.

What can I do to improve this accuracy? Improving the data cleaning may be a way, but using what else? Considering that I'm already using Stemming and removing stop words (including numbers).

# The row indices to skip - make sure 0 is not included to keep the header!
skip_idx = random.sample(range(1, num_lines), num_lines - size)

dataset = pd.read_csv('SIMPLE_CMPL.txt', skiprows=skip_idx,
                      delimiter=',', quoting=True, header=0, encoding="ISO-8859-1",
                      skip_blank_lines=True)

train_data, test_data = train_test_split(dataset, test_size=0.3)

from sklearn.feature_extraction import text 
import string
my_stop_words = text.ENGLISH_STOP_WORDS.union(["tt",'one','two','three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', '0','1','2','3','4','5','6','7','8','9','0']).union(string.punctuation)

# Stemming Code
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

stemmed_count_vect = StemmedCountVectorizer(stop_words=my_stop_words, ngram_range=(1,2))

text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect), ('tfidf', TfidfTransformer(use_idf=False)), 
                             ('mnb', MultinomialNB(fit_prior=False, alpha=0.01))])

text_mnb_stemmed = text_mnb_stemmed.fit(train_data['CDESCR'], train_data['COMPID'])

predicted_mnb_stemmed = text_mnb_stemmed.predict(test_data['CDESCR'])

np.mean(predicted_mnb_stemmed == test_data['COMPID'])


# 0.6255




# Stemming Code
from sklearn.linear_model import SGDClassifier
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

stemmed_count_vect = StemmedCountVectorizer(stop_words=my_stop_words, ngram_range=(1,1))

text_svm_stemmed = Pipeline([('vect', stemmed_count_vect), ('tfidf', TfidfTransformer(use_idf=True)), 
                             ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=0.001, n_iter = np.ceil(10**6 / train_data['COD'].count()), random_state=60))])

text_svm_stemmed = text_svm_stemmed.fit(train_data['CDESCR'], train_data['COMPID'])

predicted_svm_stemmed = text_svm_stemmed.predict(test_data['CDESCR'])

np.mean(predicted_svm_stemmed == test_data['COMPID'])


#0.6299

Upvotes: 0

Views: 889

Answers (1)

dfernig
dfernig

Reputation: 626

the prediction is much lower than this article that uses a dataset with just 20.000 rows

You can't compare scores across two different datasets. The accuracy of a machine learning classifier is not simply a function of the size of the dataset - it is also a function of the strength of the signal in the data.

Suppose you had a dataset where every document contained a feature that was unique to its label. Then you could get perfect accuracy with only a small amount of training data. By contrast, if you If you generated a dataset where documents were generated randomly and labels were randomly assigned, your classifier would only achieve 50% accuracy, no matter how much data you had.

What can I do to improve this accuracy? Improving the data cleaning may be a way, but using what else? Considering that I'm already using Stemming and removing stop words (including numbers).

A priori it may seem like stemming and removing numbers will help your score - but you don't actually know this. Sometimes features like this can help a model. Rather than deciding up front what values to use for ngram_range, stop_words, and (most importantly) alpha etc, these values should be determined via cross-validation. The article you've linked shows how to do this. I'd also take a look at the example in the sklearn GridSearchCV documentation.

Additionally, it looks like your dataset occupies a very specific domain. Sometimes handcrafted rules for cleaning and feature extraction can help on domain-specific tasks. I would spend some time examining the data, and seeing if there are more specific cleaning rules you could apply and meta-features you could extract.

Upvotes: 1

Related Questions