Getting scores too high when using SelectPercentile (from sklearn) and SVM as classifier

Question

In Python, I applied SelectPercentile (from sklearn) in order to use only the most relevant features and trained a SVM classifier. I want to mention that I have only one corpus, so I have to perform cross_validation on this corpus.
After selecting features with SelectPercentile, when I use cross_validation I get scores too high and I think I am doing something wrong, but I can't figure out what. I thought that X_all matrix has duplicate rows or duplicate columns, but it doesn't have.

I did not understand why I get this results. Can anyone make me understand what's happening under the hood and what am I doing wrong?

implementation

# extract only words from the dataset
# create dataframe using Pandas

The dataframe has the following strucutre:
- data: contains only words without any stop-words
- gender: 1 or 0

vectorizer = TfidfVectorizer(lowercase=False, min_df=1)
X_all = vectorizer.fit_transform(dataframe.data)
y_all = dataframe.gender

selector = SelectPercentile(f_classif, percentile=10)

selector.fit(X_all, y_all)
X_all = selector.transform(X_all)

classifier = svm.SVC()

param_grid = [
    {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
    {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]

gs = GridSearchCV(classifier, param_grid, cv=5, n_jobs=4)
gs.fit(X_all.toarray(), y_all)
sorted(gs.grid_scores_, key=lambda x: x.mean_validation_score, reverse=True)

print gs.best_score_
print gs.best_params_

Scores obtained using all 150 samples:
Without SelectPercentile: 0.756 (9704 features)

with percentile=90: 0.822 (8733 features)
with percentile=70: 0.947 (6792 features)
with percentile=50: 0.973 (4852 features)
with percentile=30: 0.967 (2911 features)
with percentile=10: 0.970 ( 971 features)
with percentile=3 : 0.910 ( 292 features)
with percentile=1 : 0.820 ( 98 features)

On the other hand, I tried another approach and I split the 150 samples, which I have, into train and test as follows:

features_train, features_test, target_train, target_test = train_test_split(X_all, y_all, test_size=0.20, random_state=0)


selector = SelectPercentile(f_classif, percentile=10)
selector.fit(features_train, target_train)

features_train = selector.transform(features_train).toarray()
features_test = selector.transform(features_test).toarray()

classifier = svm.SVC().fit(features_train, target_train)
print("Training score: {0:.1f}%".format(classifier.score(features_test, target_test) * 100))

Using this approach, I get a warning:

"/usr/local/lib/python2.7/dist-packages/sklearn/feature_selection/univariate_selection.py:113: UserWarning: Features [0 0 0 ..., 0 0 0] are constant. UserWarning)"

And all results are constants whatever the percentile slice is (10, 30, 50, ... 99): 44.3%

Zichen Wang · Accepted Answer

I think you should not perform feature selection (SelectPercentile) using all the data (X_all). By doing that the data you held out for testing in cross-validation 'leaked' into your model. Therefore, your feature selection kind of seen the data in the test set and tells your classifier a subset of features that are correlated with the labels in both the training and test sets.

You should use Pipeline to chain the FS with your classifier and perform cross-validation for model evaluation.

But I think your approach using univariate feature selection followed by SVM is likely to be out-performed by SVD-SVM pipeline for text classification problem. Check out this answer for example script.

Getting scores too high when using SelectPercentile (from sklearn) and SVM as classifier

Answers (1)

Related Questions