Roman
Roman

Reputation: 3241

Feature Selection for Supervised Learning

import numpy as np
from sklearn import svm
from sklearn.feature_selection import SelectKBest, f_classif

I have 3 labels (male, female, na), denoted as follows:

labels = [0,1,2]

Each label was defined by 3 features (height, weight, and age) as the training data:

Training data for males:

male_height = np.array([111,121,137,143,157])
male_weight = np.array([60,70,88,99,75])
male_age = np.array([41,32,73,54,35])

males = np.vstack([male_height,male_weight,male_age]).T

Training data for females:

female_height = np.array([91,121,135,98,90])
female_weight = np.array([32,67,98,86,56])
female_age = np.array([51,35,33,67,61])

females = np.vstack([female_height,female_weight,female_age]).T

Training data for not availables:

na_height = np.array([96,127,145,99,91])
na_weight = np.array([42,97,78,76,86])
na_age = np.array([56,35,49,64,66])

nas = np.vstack([na_height,na_weight,na_age]).T

So, the complete training data are:

trainingData = np.vstack([males,females,nas])

Complete labels are:

labels =  np.repeat(labels,5)

Now, I want to select the best features, output their names, and apply only those best features for fitting the support vector machine model.

I tried below according to the answer from @eickenberg and the comments from @larsmans

selector = SelectKBest(f_classif, k=keep)
clf = make_pipeline(selector, StandardScaler(), svm.SVC())
clf.fit(trainingData, labels)

selected = trainingData[selector.get_support()]

print selected

[[111 60 41]
 [121 70 32]]

However, all the selected elements belongs to the label 'male' with the features: height, weight, and age respectively. I could not figure out where I am messing up? Could someone guide me into right direction?

Upvotes: 0

Views: 2253

Answers (2)

Jin Lee
Jin Lee

Reputation: 371

To be honest, I have used the Support Vector Machine Model on text classification (which is an entirely different problem altogether). But, through that experience, I can confidently say that the more features you have, the better your predictions will be.

To summarize, do not filter out the features that are most important because the Support Vector Machine will make use of features no matter how little importance.

But, if this is a huge necessity, look into scikit learn's Random Forest Classifier. It can accurately assess which features are more important, using the "feature_importances_" attribute.

Here's an example of how I would use it (code not tested):

clf = RandomForestClassifier() #tweak the parameters yourself
clf.fit(X,Y) #if you're passing in a sparse matrix, apply .toarray() to X
print clf.feature_importances_

Hope that helps.

Upvotes: 2

eickenberg
eickenberg

Reputation: 14377

You can use e.g. SelectKBest as follows

from sklearn.feature_selection import SelectKBest, f_classif
keep = 2
selector = SelectKBest(f_classif, k=keep)

and place it into your pipeline

pipe = make_pipeline(selector, StandardScaler(), svm.SVC())

pipe.fit(trainingData, labels)

Upvotes: 3

Related Questions