Reputation: 510
I use this function to plot the best and worst features (coef) for each label.
def plot_coefficients(classifier, feature_names, top_features=20):
coef = classifier.coef_.ravel()
for i in np.split(coef,6):
top_positive_coefficients = np.argsort(i)[-top_features:]
top_negative_coefficients = np.argsort(i)[:top_features]
top_coefficients = np.hstack([top_negative_coefficients, top_positive_coefficients])
# create plot
plt.figure(figsize=(15, 5))
colors = ["red" if c < 0 else "blue" for c in i[top_coefficients]]
plt.bar(np.arange(2 * top_features), i[top_coefficients], color=colors)
feature_names = np.array(feature_names)
plt.xticks(np.arange(1, 1 + 2 * top_features), feature_names[top_coefficients], rotation=60, ha="right")
plt.show()
Applying it to sklearn.LinearSVC:
if (name == "LinearSVC"):
print(clf.coef_)
print(clf.intercept_)
plot_coefficients(clf, cv.get_feature_names())
The CountVectorizer used has a dimension of (15258, 26728)
.
It's a multi-class decision problem with 6 labels. Using .ravel
returns a flat array with a length of 6*26728=160368
. Meaning that all indicies that are higher than 26728 are out of bound for axis 1. Here are the top and bottom indices for one label:
i[ 0. 0. 0.07465654 ... -0.02112607 0. -0.13656274]
Top [39336 35593 29445 29715 36418 28631 28332 40843 34760 35887 48455 27753
33291 54136 36067 33961 34644 38816 36407 35781]
i[ 0. 0. 0.07465654 ... -0.02112607 0. -0.13656274]
Bot [39397 40215 34521 39392 34586 32206 36526 42766 48373 31783 35404 30296
33165 29964 50325 53620 34805 32596 34807 40895]
The first entry in the "top" list has the index 39336. This is equal to the entry 39337-26728=12608 in the vocabulary. What would I need to change in the code to make this applicable?
EDIT:
X_train = sparse.hstack([training_sentences,entities1train,predictionstraining_entity1,entities2train,predictionstraining_entity2,graphpath_training,graphpathlength_training])
y_train = DFTrain["R"]
X_test = sparse.hstack([testing_sentences,entities1test,predictionstest_entity1,entities2test,predictionstest_entity2,graphpath_testing,graphpathlength_testing])
y_test = DFTest["R"]
Dimensions:
(15258, 26728)
(15258, 26728)
(0, 0) 1
...
(15257, 0) 1
(15258, 26728)
(0, 0) 1
...
(15257, 0) 1
(15258, 26728)
(15258L, 1L)
File "TwoFeat.py", line 708, in plot_coefficients
colors = ["red" if c < 0 else "blue" for c in i[top_coefficients]]
MemoryError
Upvotes: 4
Views: 1020
Reputation: 36619
First, is it necessary you have to use ravel()
?
LinearSVC (or in fact any other classifier which has coef_
) gives out coef_
in a shape:
coef_ : array, shape = [n_features] if n_classes == 2 else [n_classes, n_features] Weights assigned to the features (coefficients in the primal problem).
So this has number of rows equal to the classes, and number of columns equal to features. For each class, you just need to access right row. The order of classes will be available from classifier.classes_
attribute.
Secondly, the indenting of your code is wrong. The code in which plot should be inside the for loop to plot for each class. Currently its outside the scope of for loop, so only will print for last class.
Correcting these two things, here's a sample reproducible code to plot the top and bottom features for each class.
def plot_coefficients(classifier, feature_names, top_features=20):
# Access the coefficients from classifier
coef = classifier.coef_
# Access the classes
classes = classifier.classes_
# Iterate the loop for number of classes
for i in range(len(classes)):
print(classes[i])
# Access the row containing the coefficients for this class
class_coef = coef[i]
# Below this, I have just replaced 'i' in your code with 'class_coef'
# Pass this to get top and bottom features
top_positive_coefficients = np.argsort(class_coef)[-top_features:]
top_negative_coefficients = np.argsort(class_coef)[:top_features]
# Concatenate the above two
top_coefficients = np.hstack([top_negative_coefficients,
top_positive_coefficients])
# create plot
plt.figure(figsize=(10, 3))
colors = ["red" if c < 0 else "blue" for c in class_coef[top_coefficients]]
plt.bar(np.arange(2 * top_features), class_coef[top_coefficients], color=colors)
feature_names = np.array(feature_names)
# Here I corrected the start to 0 (Your code has 1, which shifted the labels)
plt.xticks(np.arange(0, 1 + 2 * top_features),
feature_names[top_coefficients], rotation=60, ha="right")
plt.show()
Now just use this method as you like:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space']
dataset = fetch_20newsgroups(subset='all', categories=categories,
shuffle=True, random_state=42)
vectorizer = CountVectorizer()
# Just to replace classes from integers to their actual labels,
# you can use anything as you like in y
y = []
mapping_dict = dict(enumerate(dataset.target_names))
for i in dataset.target:
y.append(mapping_dict[i])
# Learn the words from data
X = vectorizer.fit_transform(dataset.data)
clf = LinearSVC(random_state=42)
clf.fit(X, y)
plot_coefficients(clf, vectorizer.get_feature_names())
Output from above code:
Upvotes: 2