Reputation: 3241
import numpy as np
from sklearn import svm
from sklearn.feature_selection import SelectKBest, f_classif
I have 3 labels (male, female, na), denoted as follows:
labels = [0,1,2]
Each label was defined by 3 features (height, weight, and age) as the training data:
Training data for males:
male_height = np.array([111,121,137,143,157])
male_weight = np.array([60,70,88,99,75])
male_age = np.array([41,32,73,54,35])
males = np.vstack([male_height,male_weight,male_age]).T
Training data for females:
female_height = np.array([91,121,135,98,90])
female_weight = np.array([32,67,98,86,56])
female_age = np.array([51,35,33,67,61])
females = np.vstack([female_height,female_weight,female_age]).T
Training data for not availables:
na_height = np.array([96,127,145,99,91])
na_weight = np.array([42,97,78,76,86])
na_age = np.array([56,35,49,64,66])
nas = np.vstack([na_height,na_weight,na_age]).T
So, the complete training data are:
trainingData = np.vstack([males,females,nas])
Complete labels are:
labels = np.repeat(labels,5)
Now, I want to select the best features, output their names, and apply only those best features for fitting the support vector machine model.
I tried below according to the answer from @eickenberg and the comments from @larsmans
selector = SelectKBest(f_classif, k=keep)
clf = make_pipeline(selector, StandardScaler(), svm.SVC())
clf.fit(trainingData, labels)
selected = trainingData[selector.get_support()]
print selected
[[111 60 41]
[121 70 32]]
However, all the selected elements belongs to the label 'male' with the features: height, weight, and age respectively. I could not figure out where I am messing up? Could someone guide me into right direction?
Upvotes: 0
Views: 2253
Reputation: 371
To be honest, I have used the Support Vector Machine Model on text classification (which is an entirely different problem altogether). But, through that experience, I can confidently say that the more features you have, the better your predictions will be.
To summarize, do not filter out the features that are most important because the Support Vector Machine will make use of features no matter how little importance.
But, if this is a huge necessity, look into scikit learn's Random Forest Classifier. It can accurately assess which features are more important, using the "feature_importances_" attribute.
Here's an example of how I would use it (code not tested):
clf = RandomForestClassifier() #tweak the parameters yourself
clf.fit(X,Y) #if you're passing in a sparse matrix, apply .toarray() to X
print clf.feature_importances_
Hope that helps.
Upvotes: 2
Reputation: 14377
You can use e.g. SelectKBest
as follows
from sklearn.feature_selection import SelectKBest, f_classif
keep = 2
selector = SelectKBest(f_classif, k=keep)
and place it into your pipeline
pipe = make_pipeline(selector, StandardScaler(), svm.SVC())
pipe.fit(trainingData, labels)
Upvotes: 3