Reputation: 161
I'm pretty sure it's been asked before, but I'm unable to find an answer
Running Logistic Regression using sklearn on python, I'm able to transform my dataset to its most important features using the Transform method
classf = linear_model.LogisticRegression()
func = classf.fit(Xtrain, ytrain)
reduced_train = func.transform(Xtrain)
How can I tell which features were selcted as most important? more generally how can I calculate the p-value of each feature in the dataset?
Upvotes: 16
Views: 45814
Reputation: 4924
As suggested in comments above you can (and should) scale your data prior to your fit thus making the coefficients comparable. Below is a little code to show how this would work. I follow this format for comparison.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt
x1 = np.random.randn(100)
x2 = np.random.randn(100)
x3 = np.random.randn(100)
#Make difference in feature dependance
y = (3 + x1 + 2*x2 + 5*x3 + 0.2*np.random.randn()) > 0
X = pd.DataFrame({'x1':x1,'x2':x2,'x3':x3})
#Scale your data
scaler = StandardScaler()
scaler.fit(X)
X_scaled = pd.DataFrame(scaler.transform(X),columns = X.columns)
clf = LogisticRegression(random_state = 0)
clf.fit(X_scaled, y)
feature_importance = abs(clf.coef_[0])
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
featfig = plt.figure()
featax = featfig.add_subplot(1, 1, 1)
featax.barh(pos, feature_importance[sorted_idx], align='center')
featax.set_yticks(pos)
featax.set_yticklabels(np.array(X.columns)[sorted_idx], fontsize=8)
featax.set_xlabel('Relative Feature Importance')
plt.tight_layout()
plt.show()
Upvotes: 15
Reputation: 363838
LogisticRegression.transform
takes a threshold
value that determines which features to keep. Straight from the docstring:
Threshold : string, float or None, optional (default=None)
The threshold value to use for feature selection. Features whose
importance is greater or equal are kept while the others are
discarded. If "median" (resp. "mean"), then the threshold value is
the median (resp. the mean) of the feature importances. A scaling
factor (e.g., "1.25*mean") may also be used. If None and if
available, the object attribute threshold
is used. Otherwise,
"mean" is used by default.
There is no object attribute threshold
on LR estimators, so only those features with higher absolute value than the mean (after summing over the classes) are kept by default.
Upvotes: 4
Reputation: 251608
You can look at the coefficients in the coef_
attribute of the fitted model to see which features are most important. (For LogisticRegression, all transform
is doing is looking at which coefficients are highest in absolute value.)
Most scikit-learn models do not provide a way to calculate p-values. Broadly speaking, these models are designed to be used to actually predict outputs, not to be inspected to glean understanding about how the prediction is done. If you're interested in p-values you could take a look at statsmodels, although it is somewhat less mature than sklearn.
Upvotes: 4