Reputation: 148
I'm new to machine learning. I'm trying to do some text classification. 'CleanDesc' has the text sentence. And 'output' has the corresponding output. Initially i tried using one input parameter which is the string of texts(newMerged.cleanDesc) and one output parameter(newMerged.output)
finaldata = newMerged[['id','CleanDesc','type','output']]
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(newMerged.CleanDesc)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, newMerged.output)
testdata = newMerged.ix[1:200]
X_test_counts = count_vect.transform(testdata.CleanDesc)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
predicted = clf.predict(X_new_tfidf)
This works fine. But the accuracy is very low. I wanted to include one more parameter(newMerged.type) as the input, along with the text to try improving it. Can I do that? How do I do it. newMerged.type is not a text. It just a two character string like "HT". I tried doing it as follows, but it failed,
finaldata = newMerged[['id','CleanDesc','type','output']]
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(newMerged.CleanDesc)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit([[X_train_tfidf,newMerged.type]],
newMerged.output)
testdata = newMerged.ix[1:200]
X_test_counts = count_vect.transform(testdata.CleanDesc)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
predicted = clf.predict([[X_new_tfidf, testdata.type]])
Upvotes: 2
Views: 1769
Reputation: 16966
You have to use hstack from sicpy for appending arrays to sparse matrix.
Try this!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from scipy.sparse import hstack
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.shape)
#
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
(4, 9)
You need to do encoding of your categorical variables.
cat_varia= ['s','ut','ss','ss']
lb=LabelBinarizer()
feature2=lb.fit_transform(cat_varia)
appended_X = hstack((X, feature2))
import pandas as pd
pd.DataFrame(appended_X.toarray())
#
0 1 2 3 4 5 6 7 8 9 10 11
0 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085 0.000000 0.384085 1.0 0.0 0.0
1 0.000000 0.687624 0.000000 0.281089 0.000000 0.538648 0.281089 0.000000 0.281089 0.0 0.0 1.0
2 0.511849 0.000000 0.000000 0.267104 0.511849 0.000000 0.267104 0.511849 0.267104 0.0 1.0 0.0
3 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085 0.000000 0.384085 0.0 1.0 0.0
Upvotes: 2