blue
blue

Reputation: 7375

How to compute k-fold cross validation and standard dev of performance for each classifier?

I need to (per a prompt) "compute the n-fold cross validation as well as mean and standard deviation of the performance measure on the n folds" for each of 3 algorithms.

My original dataframe is structured like this, where there are 16 types that repeat:

target   type    post
1      intj    "hello world shdjd"
2      entp    "hello world fddf"
16     estj   "hello world dsd"
4      esfp    "hello world sfs"
1      intj    "hello world ddfd"

Ive trained and computed accuracy for Naive Bayes, SVM and Logistic Regression like this:

text_clf3 = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg')),
])

text_clf3.fit(result.post, result.target)

predicted3 = text_clf3.predict(docs_test)
print("Logistics Regression: ")
print(np.mean(predicted3 == result.target))

With clf being

LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg')

SGDClassifier(loss='hinge', penalty='l2',
                          alpha=1e-3, random_state=42,
                          max_iter=5, tol=None)

and

MultinomialNB(alpha = 0.0001)

I can get (metrics.classification_report(result.target, predicted3) for each model, but dont know how to implement cross validation.

How can I do this?

Upvotes: 0

Views: 1574

Answers (1)

Shihab Shahriar Khan
Shihab Shahriar Khan

Reputation: 5455

I can not test this because I don't have the datasets, but the code below will hopefully make the main idea clear. In code below, all_post denotes all samples combined, both result.post and docs_test according to your example, and n is assumed to be 10.

from sklearn.model_selection import cross_val_score

models = {'lr':LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg'),
          'nb':MultinomialNB(alpha = 0.0001),
          'sgd':SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,
                      max_iter=5, tol=None)}

for name,clf in models.items():
    pipe = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', clf)])
    res = cross_val_score(pipe,all_post,all_target,cv=10) #res is an array of size 10
    print(name,res.mean(),res.std())

Upvotes: 0

Related Questions