Slowat_Kela
Slowat_Kela

Reputation: 1511

Extract feature columns from training data set based on RFE output

I want to run a logistic regression, of a data set with 400 features/columns (x_ vals) and one label (the labels column)

I made a training and testing data set like this:

X_train, X_test, y_train, y_test = train_test_split(x_vals,labels,test_size=0.2)

#and then I identified which of my x_train columns are most predictive of my labels with this:

logreg = LogisticRegression()
rfe = RFE(logreg)
rfe = rfe.fit(X_train, y_train)
print(rfe.support_)
print(rfe.ranking_)

print(rfe.support_[0])

The output from rfe.support_ is a list of True/False for should be kept in the data set; e.g.

[ True False False  True False  True False  True False False False  True
  True  True  True  True False False  True False False  True False  True
  True  True False  True  True  True  True  True False  True  True False

(in reality it's length 400)

All I want to do is work out the most efficient way to keep only the columns that say True in rfe.support_ in X_train (i.e. only keep the predictive features)

I can think of a slow way of doing it:

for each_col in dataframe:
      get the index of each_col
      get the value of that index in rfe.support_
      if the value == 'False': 
           remove col from dataframe

I feel like this is cumbersome, and I'm wondering before I start is there a more pythonic way to do it?

Upvotes: 1

Views: 873

Answers (1)

Athena
Athena

Reputation: 320

You could achieve this with pd.drop selecting the column names which you wish to delete. The the code for an example:

>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),columns=['A', 'B', 'C', 'D'])
>>> df
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
>>> keep_list = [True, False, False, True]

# Get the names of the columns which are False - i.e. columns to delete
>>> columns_to_remove = df.columns.values[np.logical_not(keep_list)]
>>> columns_to_remove
array(['B', 'C'], dtype=object)

# Returns our dataframe without columns 'B' and 'C'
>>> df.drop(columns=columns_to_remove)
   A   D
0  0   3
1  4   7
2  8  11

Explanation

We want to delete columns which are not to be kept. So, we flip the logic using np.logical_not(keep_list). We can then index the column names we wish to delete with df.columns.values[np.logical_not(keep_list)] and then delete these columns using df.drop(columns=columns_to_remove).

Upvotes: 1

Related Questions