Reputation: 1695
I am trying to build a predictive model using python. The training and test data set has over 400 variables. On using feature selection on training data set the number of variables are reduced to 180
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold = .9)
and then I am training a model using gradient boosting algorithm achieveing .84 AUC accuracy in cross validation.
from sklearn import ensemble
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_auc_score as auc
df_fit, df_eval, y_fit, y_eval= train_test_split( df, y, test_size=0.2, random_state=1 )
boosting_model = ensemble.GradientBoostingClassifier(n_estimators=100, max_depth=3,
min_samples_leaf=100, learning_rate=0.1,
subsample=0.5, random_state=1)
boosting_model.fit(df_fit, y_fit)
But when I am trying to use this model to predict for prediction data set it is giving me error
predict_target = boosting_model.predict(df_prediction)
Error: Number of variables in prediction data set 'df_prediction' does not match the number of variables in the model
Which makes sense because total variables in testing data remains to be over 400. My question is there anyway to bypass this problem and keep using feature selection for predictive modeling. Because if I remove it the accuracy of model drops down to .5 which is very poor. Thanks!
Upvotes: 4
Views: 1670
Reputation: 66775
You should transform your prediction matrix through your feature selection too. So somewhere in your code you do
df = sel.fit_transform(X)
and before predicting
df_prediction = sel.transform(X_prediction)
Upvotes: 4