Feature Selection for Supervised Learning

Question

import numpy as np
from sklearn import svm
from sklearn.feature_selection import SelectKBest, f_classif

I have 3 labels (male, female, na), denoted as follows:

labels = [0,1,2]

Each label was defined by 3 features (height, weight, and age) as the training data:

Training data for males:

male_height = np.array([111,121,137,143,157])
male_weight = np.array([60,70,88,99,75])
male_age = np.array([41,32,73,54,35])

males = np.vstack([male_height,male_weight,male_age]).T

Training data for females:

female_height = np.array([91,121,135,98,90])
female_weight = np.array([32,67,98,86,56])
female_age = np.array([51,35,33,67,61])

females = np.vstack([female_height,female_weight,female_age]).T

Training data for not availables:

na_height = np.array([96,127,145,99,91])
na_weight = np.array([42,97,78,76,86])
na_age = np.array([56,35,49,64,66])

nas = np.vstack([na_height,na_weight,na_age]).T

So, the complete training data are:

trainingData = np.vstack([males,females,nas])

Complete labels are:

labels =  np.repeat(labels,5)

Now, I want to select the best features, output their names, and apply only those best features for fitting the support vector machine model.

I tried below according to the answer from @eickenberg and the comments from @larsmans

selector = SelectKBest(f_classif, k=keep)
clf = make_pipeline(selector, StandardScaler(), svm.SVC())
clf.fit(trainingData, labels)

selected = trainingData[selector.get_support()]

print selected

[[111 60 41]
 [121 70 32]]

However, all the selected elements belongs to the label 'male' with the features: height, weight, and age respectively. I could not figure out where I am messing up? Could someone guide me into right direction?

eickenberg · Accepted Answer

You can use e.g. SelectKBest as follows

from sklearn.feature_selection import SelectKBest, f_classif
keep = 2
selector = SelectKBest(f_classif, k=keep)

and place it into your pipeline

pipe = make_pipeline(selector, StandardScaler(), svm.SVC())

pipe.fit(trainingData, labels)

Feature Selection for Supervised Learning

Answers (2)

Related Questions