Classifying sentences with SciKit

Question

I have two files containing a bunch of sentences. One of the files has sentences with positive words, while the other with negative words. I am trying to train a classifier on two classes, "positive" and "negative" so that when I give it a new sentence it will tell to which category it belongs to. This is what I have so far:

...
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False)

input_list = ['A positive sentence', 'A negative sentence', ...]
class_list = [0, 1, ...]

df= pd.DataFrame({'text':input_list,'class': class_list})

X = tfidf_vect.fit_transform(df['text'].values)
y = df['class'].values

a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=10)

classifier.fit(a_train.toarray(), b_train)
prediction = classifier.predict(a_test.toarray())

from sklearn import metrics
print(metrics.f1_score(b_test, prediction, average='macro'))

# classify a new sentence
df= pd.DataFrame({'text': ['A negative sentence', 'A positive sentence'],'class': [1, 0]})
print(classifier.predict(tfidf_vect.transform(df['text'].values)))

When I try to classify new sentences I get a feature mismatch. My question is, what exactly are the features that are being considered in this code? And how can I define the features (e.g. if I wanted to do more than a bag of words and have each vector for every word encode something more)?

piman314 · Accepted Answer

There's a nice function built in toTfidfVectorizer to help with that. For your example below you can see which words the features correspond to.

tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True,
                            sublinear_tf=False)

input_list = ['A positive sentence', 'A negative sentence', ]
class_list = [0, 1]

df= pd.DataFrame({'text':input_list,'class': class_list})

X = tfidf_vect.fit_transform(df['text'].values)
y = df['class'].values

print(tfidf_vect.get_feature_names())
print()
print(X.todense())

Output

[u'negative', u'positive', u'sentence']

[[ 0.          0.81480247  0.57973867]
 [ 0.81480247  0.          0.57973867]]

If you want to extend your model to include pairs of words you can do this easily too:

tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True,
                            sublinear_tf=False, ngram_range=(1, 2))

input_list = ['A positive sentence', 'A negative sentence', ]
class_list = [0, 1]

df= pd.DataFrame({'text':input_list,'class': class_list})

X = tfidf_vect.fit_transform(df['text'].values)
y = df['class'].values

print(tfidf_vect.get_feature_names())
print()
print(X.todense())

Output

[u'negative', u'negative sentence', u'positive', u'positive sentence', u'sentence']

[[ 0.          0.          0.6316672   0.6316672   0.44943642]
 [ 0.6316672   0.6316672   0.          0.          0.44943642]]

If you want to add more custom features then you can do this by bolting them on the end, like this:

X = np.array(X.todense())
my_feature = np.array([[0.7, 1.2]])
np.concatenate((X, my_feature.T), axis=1)

Output:

array([[ 0.        ,  0.        ,  0.6316672 ,  0.6316672 ,  0.44943642,
     0.7       ],
       [ 0.6316672 ,  0.6316672 ,  0.        ,  0.        ,  0.44943642,
     1.2       ]])

Classifying sentences with SciKit

Answers (1)

Related Questions