mel
mel

Reputation: 161

sklearn logistic regression - important features

I'm pretty sure it's been asked before, but I'm unable to find an answer

Running Logistic Regression using sklearn on python, I'm able to transform my dataset to its most important features using the Transform method

classf = linear_model.LogisticRegression()
func  = classf.fit(Xtrain, ytrain)
reduced_train = func.transform(Xtrain)

How can I tell which features were selcted as most important? more generally how can I calculate the p-value of each feature in the dataset?

Upvotes: 16

Views: 45814

Answers (3)

Keith
Keith

Reputation: 4924

As suggested in comments above you can (and should) scale your data prior to your fit thus making the coefficients comparable. Below is a little code to show how this would work. I follow this format for comparison.

import numpy as np    
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt

x1 = np.random.randn(100)
x2 = np.random.randn(100)
x3 = np.random.randn(100)

#Make difference in feature dependance
y = (3 + x1 + 2*x2 + 5*x3 + 0.2*np.random.randn()) > 0

X = pd.DataFrame({'x1':x1,'x2':x2,'x3':x3})

#Scale your data
scaler = StandardScaler()
scaler.fit(X) 
X_scaled = pd.DataFrame(scaler.transform(X),columns = X.columns)

clf = LogisticRegression(random_state = 0)
clf.fit(X_scaled, y)

feature_importance = abs(clf.coef_[0])
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5

featfig = plt.figure()
featax = featfig.add_subplot(1, 1, 1)
featax.barh(pos, feature_importance[sorted_idx], align='center')
featax.set_yticks(pos)
featax.set_yticklabels(np.array(X.columns)[sorted_idx], fontsize=8)
featax.set_xlabel('Relative Feature Importance')

plt.tight_layout()   
plt.show()

Upvotes: 15

Fred Foo
Fred Foo

Reputation: 363838

LogisticRegression.transform takes a threshold value that determines which features to keep. Straight from the docstring:

Threshold : string, float or None, optional (default=None) The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If "median" (resp. "mean"), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., "1.25*mean") may also be used. If None and if available, the object attribute threshold is used. Otherwise, "mean" is used by default.

There is no object attribute threshold on LR estimators, so only those features with higher absolute value than the mean (after summing over the classes) are kept by default.

Upvotes: 4

BrenBarn
BrenBarn

Reputation: 251608

You can look at the coefficients in the coef_ attribute of the fitted model to see which features are most important. (For LogisticRegression, all transform is doing is looking at which coefficients are highest in absolute value.)

Most scikit-learn models do not provide a way to calculate p-values. Broadly speaking, these models are designed to be used to actually predict outputs, not to be inspected to glean understanding about how the prediction is done. If you're interested in p-values you could take a look at statsmodels, although it is somewhat less mature than sklearn.

Upvotes: 4

Related Questions