Reputation: 1661
I ran Multinomial and Bernoulli Naive Bayes, and Linear SVC on a set tweets I have. They do well on a 60/40 split of 1000 training tweets (80%,80%,90% respectively).
Each algorithm has parameters that can be changed, and I am wondering if I can obtain better results by altering the parameters. I don't know too much about machine learning beyond training, testing, and predicting, so I was wondering if someone could give me some advice on which parameters I could tweak.
Here is the code I used:
import codecs
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn import svm
trainfile = 'training_words.txt'
testfile = 'testing_words.txt'
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = training_labels
mnb = svm.LinearSVC() #Or any other classifier
mnb.fit(trainset, tags)
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = mnb.predict(testset)
print results
Upvotes: 1
Views: 635
Reputation: 24752
You can use Grid Search Cross Validation
to tune your model parameters with a stratified K-Fold cross-validation split. Here is an example code.
import codecs
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn import svm
from sklearn.grid_search import GridSearchCV
testfile = 'testing_words.txt'
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = training_labels
mnb = svm.LinearSVC() # or any other classifier
# check out the sklearn online docs to see what params choice we have for your
# particular choice of estimator, for SVM, C, class_weight are important ones to tune
params_space = {'C': np.logspace(-5, 0, 10), 'class_weight':[None, 'auto']}
# build a grid search cv, n_jobs=-1 to use all your processor cores
gscv = GridSearchCV(mnb, params_space, cv=10, n_jobs=-1)
# fit the model
gscv.fit(trainset, tags)
# give a look at your best params combination and best score you have
gscv.best_estimator_
gscv.best_params_
gscv.best_score_
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = gscv.predict(testset)
print results
Upvotes: 2