Multiple input parameters during text classification - Scikit learn

Question

I'm new to machine learning. I'm trying to do some text classification. 'CleanDesc' has the text sentence. And 'output' has the corresponding output. Initially i tried using one input parameter which is the string of texts(newMerged.cleanDesc) and one output parameter(newMerged.output)

finaldata = newMerged[['id','CleanDesc','type','output']]

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(newMerged.CleanDesc)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(X_train_tfidf, newMerged.output)    
testdata = newMerged.ix[1:200]
X_test_counts = count_vect.transform(testdata.CleanDesc)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

predicted = clf.predict(X_new_tfidf)

This works fine. But the accuracy is very low. I wanted to include one more parameter(newMerged.type) as the input, along with the text to try improving it. Can I do that? How do I do it. newMerged.type is not a text. It just a two character string like "HT". I tried doing it as follows, but it failed,

finaldata = newMerged[['id','CleanDesc','type','output']]

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(newMerged.CleanDesc)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit([[X_train_tfidf,newMerged.type]], 
newMerged.output)    
testdata = newMerged.ix[1:200]
X_test_counts = count_vect.transform(testdata.CleanDesc)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

predicted = clf.predict([[X_new_tfidf, testdata.type]])

Venkatachalam · Accepted Answer

You have to use hstack from sicpy for appending arrays to sparse matrix.

Try this!

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from scipy.sparse import hstack
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

print(X.shape)

#

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
(4, 9)

You need to do encoding of your categorical variables.

cat_varia= ['s','ut','ss','ss']
lb=LabelBinarizer()
feature2=lb.fit_transform(cat_varia)

appended_X = hstack((X, feature2))

import pandas as pd
pd.DataFrame(appended_X.toarray())

#

    0   1   2   3   4   5   6   7   8   9   10  11
0   0.000000    0.469791    0.580286    0.384085    0.000000    0.000000    0.384085    0.000000    0.384085    1.0 0.0 0.0
1   0.000000    0.687624    0.000000    0.281089    0.000000    0.538648    0.281089    0.000000    0.281089    0.0 0.0 1.0
2   0.511849    0.000000    0.000000    0.267104    0.511849    0.000000    0.267104    0.511849    0.267104    0.0 1.0 0.0
3   0.000000    0.469791    0.580286    0.384085    0.000000    0.000000    0.384085    0.000000    0.384085    0.0 1.0 0.0

Multiple input parameters during text classification - Scikit learn

Answers (1)

Related Questions