Uasthana
Uasthana

Reputation: 1695

Prediction after feature selection python

I am trying to build a predictive model using python. The training and test data set has over 400 variables. On using feature selection on training data set the number of variables are reduced to 180

from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold = .9)

and then I am training a model using gradient boosting algorithm achieveing .84 AUC accuracy in cross validation.

from sklearn import ensemble
from sklearn.cross_validation import train_test_split 
from sklearn.metrics import roc_auc_score as auc 
df_fit, df_eval, y_fit, y_eval= train_test_split( df, y, test_size=0.2, random_state=1 )
boosting_model = ensemble.GradientBoostingClassifier(n_estimators=100, max_depth=3, 
                                                    min_samples_leaf=100, learning_rate=0.1, 
                                                    subsample=0.5, random_state=1)
boosting_model.fit(df_fit, y_fit)

But when I am trying to use this model to predict for prediction data set it is giving me error

predict_target = boosting_model.predict(df_prediction)
Error: Number of variables in prediction data set 'df_prediction' does not match the number of variables in the model

Which makes sense because total variables in testing data remains to be over 400. My question is there anyway to bypass this problem and keep using feature selection for predictive modeling. Because if I remove it the accuracy of model drops down to .5 which is very poor. Thanks!

Upvotes: 4

Views: 1670

Answers (1)

lejlot
lejlot

Reputation: 66775

You should transform your prediction matrix through your feature selection too. So somewhere in your code you do

df = sel.fit_transform(X)

and before predicting

df_prediction = sel.transform(X_prediction)

Upvotes: 4

Related Questions