Reputation: 1511
I want to run a logistic regression, of a data set with 400 features/columns (x_ vals) and one label (the labels column)
I made a training and testing data set like this:
X_train, X_test, y_train, y_test = train_test_split(x_vals,labels,test_size=0.2)
#and then I identified which of my x_train columns are most predictive of my labels with this:
logreg = LogisticRegression()
rfe = RFE(logreg)
rfe = rfe.fit(X_train, y_train)
print(rfe.support_)
print(rfe.ranking_)
print(rfe.support_[0])
The output from rfe.support_ is a list of True/False for should be kept in the data set; e.g.
[ True False False True False True False True False False False True
True True True True False False True False False True False True
True True False True True True True True False True True False
(in reality it's length 400)
All I want to do is work out the most efficient way to keep only the columns that say True in rfe.support_ in X_train (i.e. only keep the predictive features)
I can think of a slow way of doing it:
for each_col in dataframe:
get the index of each_col
get the value of that index in rfe.support_
if the value == 'False':
remove col from dataframe
I feel like this is cumbersome, and I'm wondering before I start is there a more pythonic way to do it?
Upvotes: 1
Views: 873
Reputation: 320
You could achieve this with pd.drop
selecting the column names which you wish to delete. The the code for an example:
>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),columns=['A', 'B', 'C', 'D'])
>>> df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
>>> keep_list = [True, False, False, True]
# Get the names of the columns which are False - i.e. columns to delete
>>> columns_to_remove = df.columns.values[np.logical_not(keep_list)]
>>> columns_to_remove
array(['B', 'C'], dtype=object)
# Returns our dataframe without columns 'B' and 'C'
>>> df.drop(columns=columns_to_remove)
A D
0 0 3
1 4 7
2 8 11
Explanation
We want to delete columns which are not to be kept. So, we flip the logic using np.logical_not(keep_list)
. We can then index the column names we wish to delete with df.columns.values[np.logical_not(keep_list)]
and then delete these columns using df.drop(columns=columns_to_remove)
.
Upvotes: 1