Reputation: 1138
I have a retail dataset with product_description, price, supplier, category
as columns.
I used product_description
as feature:
from sklearn import model_selection, preprocessing, naive_bayes
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['product_description'], df['category'])
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(df['product_description'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
classifier = naive_bayes.MultinomialNB().fit(xtrain_tfidf, train_y)
# predict the labels on validation dataset
predictions = classifier.predict(xvalid_tfidf)
metrics.accuracy_score(predictions, valid_y) # ~20%, very low
Since the accuracy is very low, I want to add the supplier and price as features too. How can I incorporate this in the code?
I have tried other classifiers like LR, SVM, and Random Forrest, but they had (almost) the same outcome.
Upvotes: 5
Views: 614
Reputation: 11240
The TF-IDF vectorizer returns a matrix: one row per example with the scores. You can modify this matrix as you wish before feeding it into the classifier.
Prepare your additional features as a NumPy array of shape: number of examples × number of features.
Use np.concatenate
with axis=1
.
Fit the classifier as you did before.
It is usually a good idea to normalize real-valued features. Also, you can try different classifiers: Logistic Regression or SVM might do a better job for real-valued features than Naive Bayes.
Upvotes: 1