Reputation: 4695
Given the following data:
import pandas as pd
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
import io
df = pd.read_csv(
io.StringIO(
"noise_0,x0,x1,y\n1.0322600657764203,10.354468012163927,7.655143584899129,168.06121374114608\n4.478935261759052,8.786243147880384,6.244283164157256,156.570749155167\n9.085955030930956,10.450548129254543,8.084427493431185,152.10261405911672\n2.9361414837367947,10.869778308219216,9.165630427431644,129.72126680171317\n2.877753385863487,11.236593954599316,5.7987616455741575,55.294961794556315\n1.3002857211827767,9.111226379916955,10.289447419679227,308.7475968288771\n0.19366957870297075,9.753313270715008,9.803181441185592,163.337342478704\n6.788355329398909,9.752270042969856,9.004988677803736,271.9442757290742\n2.1162811600005904,8.67161845864426,9.801711898528824,158.09622149503954\n2.655466593722262,8.830913103331573,6.632544281651334,316.23912914041557\n"
)
)
which looks as:
noise_0 x0 x1 y
0 1.032260 10.354468 7.655144 168.061214
1 4.478935 8.786243 6.244283 156.570749
2 9.085955 10.450548 8.084427 152.102614
3 2.936141 10.869778 9.165630 129.721267
4 2.877753 11.236594 5.798762 55.294962
5 1.300286 9.111226 10.289447 308.747597
6 0.193670 9.753313 9.803181 163.337342
7 6.788355 9.752270 9.004989 271.944276
8 2.116281 8.671618 9.801712 158.096221
9 2.655467 8.830913 6.632544 316.239129
and has correlation matrix
| | noise_0 | x0 | x1 | y |
|:--------|----------:|----------:|----------:|----------:|
| noise_0 | 1 | 0.159642 | -0.208966 | -0.02006 |
| x0 | 0.159642 | 1 | -0.197431 | -0.620964 |
| x1 | -0.208966 | -0.197431 | 1 | 0.304241 |
| y | -0.02006 | -0.620964 | 0.304241 | 1 |
I'm interested how I can find the variable names x0,x1
from sklearns feature selection.
When I try the following:
X_new = SelectKBest(f_regression, k=2).fit(df.drop("y", axis=1), df["y"])
I'm expecting this to select x1, x2
, but am not sure how to determine which features were actually selected by it.
Upvotes: 3
Views: 1663
Reputation: 101
SelectKBest
provides a get_support()
method that can show you which features were selected.
Rearrange the code to save the SelectKBest
instance:
selector = SelectKBest(f_regression, k=2)
X = df.drop("y", axis=1)
X_new = selector.fit(X, df["y"])
Now, running selector.get_support()
will give us:
[False, True, True]
We can then use selector.get_support()
to mask the columns of X:
X.columns.values[selector.get_support()]
for a final output of:
['x0', 'x1']
Upvotes: 4