Reputation: 969
I am going to do Text classification first time with Naive Bayes. This code I found on http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html :
>>> from sklearn.naive_bayes import MultinomialNB
>>> clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
I want to resolve one doubt about the parameters X_train_tfidf
, twenty_train.target
passed to the function fit().
X_train_tfidf is the tfidf vector representation of all the documents in the train set.
twenty_train.target is the corresponding labels of documents in the exact order as they appear in the X_train_tfidf set.
Am I correct?
Upvotes: 1
Views: 130
Reputation: 1202
Short answer: Yes
Long answer: This is true for every fit method you will find using the API. Given a matrix of documents X with dimensions [m, n], the target vector Y will have dimension [n, 1] and document X[:, j] matches target Y[j] for every j from 0 to n-1.
If documents and targets don't match you will probably get a very poor and unreasonable result from your training process.
Upvotes: 1