Text Classification + Naive Bayes + Scikit learn

Question

I am going to do Text classification first time with Naive Bayes. This code I found on http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html :

>>> from sklearn.naive_bayes import MultinomialNB
>>> clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

I want to resolve one doubt about the parameters X_train_tfidf, twenty_train.target passed to the function fit().

X_train_tfidf is the tfidf vector representation of all the documents in the train set.

twenty_train.target is the corresponding labels of documents in the exact order as they appear in the X_train_tfidf set.

Am I correct?

Fabio Picchi · Accepted Answer

Short answer: Yes

Long answer: This is true for every fit method you will find using the API. Given a matrix of documents X with dimensions [m, n], the target vector Y will have dimension [n, 1] and document X[:, j] matches target Y[j] for every j from 0 to n-1.

If documents and targets don't match you will probably get a very poor and unreasonable result from your training process.

Text Classification + Naive Bayes + Scikit learn

Answers (1)

Related Questions