Snow
Snow

Reputation: 1138

How to add more features in multi text classification?

I have a retail dataset with product_description, price, supplier, category as columns. I used product_description as feature:

from sklearn import model_selection, preprocessing, naive_bayes

# split the dataset into training and validation datasets 
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['product_description'], df['category'])

# label encode the target variable 
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(df['product_description'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)

classifier = naive_bayes.MultinomialNB().fit(xtrain_tfidf, train_y)

# predict the labels on validation dataset
predictions = classifier.predict(xvalid_tfidf)
metrics.accuracy_score(predictions, valid_y) # ~20%, very low

Since the accuracy is very low, I want to add the supplier and price as features too. How can I incorporate this in the code?

I have tried other classifiers like LR, SVM, and Random Forrest, but they had (almost) the same outcome.

Upvotes: 5

Views: 614

Answers (1)

Jindřich
Jindřich

Reputation: 11240

The TF-IDF vectorizer returns a matrix: one row per example with the scores. You can modify this matrix as you wish before feeding it into the classifier.

  • Prepare your additional features as a NumPy array of shape: number of examples × number of features.

  • Use np.concatenate with axis=1.

  • Fit the classifier as you did before.

It is usually a good idea to normalize real-valued features. Also, you can try different classifiers: Logistic Regression or SVM might do a better job for real-valued features than Naive Bayes.

Upvotes: 1

Related Questions