Reputation: 11
I am trying to train a classifier to take in a news headline as input, and output tags that fit the following headline. My data contains a bunch of news headlines as the input variables and meta-tags for those headlines as the output variables.
I One-Hot_Encoded both the headlines and their corresponding meta-tags into two separate CSV's. I then combined them into one large data frame with the X_train values being a 5573x958 numpy array for the headline words, and the y_train values being a 5573x843 numpy array.
Here is the following image of a pandas data-frame containing my data in One-Hot-Encoded form.
The goal of my classifier is for me to feed in a headline and have the most related tags to that headline as the output. The problem I have is the following.
X_train = train_set.iloc[:, :958].values
X_train.shape
(out) (5573, 958)
y_train = train_set.iloc[:, 958:].values
y_train.shape
(out) (5573, 843)
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB().fit(X_train, y_train)
When I train it using a naive-bayes classifier, I get the following error message:
bad input shape (5573, 843)
From what I researched, the only way I can have a multi-label target values is by One-Hot-Encoding them as when I tried LabelEncoder() or MultiLabelBinarizer() I had to specify the name of each column to be binarized and when I have over 800 columns (words) to specify, I could not figure out how do it. So I just One-Hot-Encoded them which I believe gives the same result, just the classifier doesn't like it as input. Any suggestions on how I can fix this?
Upvotes: 0
Views: 183
Reputation: 132
You can use the Multi target classification of Sklearn. Here is an example :
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultiOutputClassifier(MultinomialNB()).fit(X_train, y_train)
You can see the documentation from this link sklearn.multioutput.MultiOutputClassifier
Upvotes: 1