Reputation: 1156
I am using Python 3.5 and python implementation of XGBoost, version 0.6
I built a forward feature selection routine in Python, which iteratively builds the optimal set of features (leading to the best score, here metric is binary classification error).
On my data set, using xgb.cv routine, I can get down to an error rate of around 0.21 by increasing max_depth (of trees) up to 40...
But then if I do a custom cross-validation, using the same XG Boost parameters, same folds, same metric and same data set, I reach the best score being 0.70 with max_depth of 4 ... if I use the optimal max_depth obtained by my xgb.cv routine, my score drops to 0.65 ... I just don't understand what is happening ...
My best guess is that xgb.cv is using different folds (i.e. shuffles the data before partitioning), but I also think I submit the folds as an input to xgb.cv (with option Shuffle=False) ... so, it might be something completely different ...
Here is the code of the forward_feature_selection (using xgb.cv):
def Forward_Feature_Selection(train, y_train, params, num_round=30, threshold=0, initial_score=0.5, to_exclude = [], nfold = 5):
k_fold = KFold(n_splits=13)
selected_features = []
gain = threshold + 1
previous_best_score = initial_score
train = train.drop(train.columns[to_exclude], axis=1) # df.columns is zero-based pd.Index
features = train.columns.values
selected = np.zeros(len(features))
scores = np.zeros(len(features))
while (gain > threshold): # we start a add-a-feature loop
for i in range(0,len(features)):
if (selected[i]==0): # take only features not yet selected
selected_features.append(features[i])
new_train = train.iloc[:][selected_features]
selected_features.remove(features[i])
dtrain = xgb.DMatrix(new_train, y_train, missing = None)
# dtrain = xgb.DMatrix(pd.DataFrame(new_train), y_train, missing = None)
if (i % 10 == 0):
print("Launching XGBoost for feature "+ str(i))
xgb_cv = xgb.cv(params, dtrain, num_round, nfold=13, folds=k_fold, shuffle=False)
if params['objective'] == 'binary:logistic':
scores[i] = xgb_cv.tail(1)["test-error-mean"] #classification
else:
scores[i] = xgb_cv.tail(1)["test-rmse-mean"] #regression
else:
scores[i] = initial_score # discard already selected variables from candidates
best = np.argmin(scores)
gain = previous_best_score - scores[best]
if (gain > 0):
previous_best_score = scores[best]
selected_features.append(features[best])
selected[best] = 1
print("Adding feature: " + features[best] + " increases score by " + str(gain) + ". Final score is now: " + str(previous_best_score))
return (selected_features, previous_best_score)
and here is my "custom" cross validation:
mean_error_rate = 0
for train, test in k_fold.split(ds):
dtrain = xgb.DMatrix(pd.DataFrame(ds.iloc[train]), dc.iloc[train]["bin_spread"], missing = None)
gbm = xgb.train(params, dtrain, 30)
dtest = xgb.DMatrix(pd.DataFrame(ds.iloc[test]), dc.iloc[test]["bin_spread"], missing = None)
res.ix[test,"pred"] = gbm.predict(dtest)
cv_reg = reg.fit(pd.DataFrame(ds.iloc[train]), dc.iloc[train]["bin_spread"])
res.ix[test,"lasso"] = cv_reg.predict(pd.DataFrame(ds.iloc[test]))
res.ix[test,"y_xgb"] = res.loc[test,"pred"] > 0.5
res.ix[test, "xgb_right"] = (res.loc[test,"y_xgb"]==res.loc[test,"bin_spread"])
print (str(100*np.sum(res.loc[test, "xgb_right"])/(N/13)))
mean_error_rate += 100*(np.sum(res.loc[test, "xgb_right"])/(N/13))
print("mean_error_rate is : " + str(mean_error_rate/13))
using the following parameters:
params = {"objective": "binary:logistic",
"booster":"gbtree",
"max_depth":4,
"eval_metric" : "error",
"eta" : 0.15}
res = pd.DataFrame(dc["bin_spread"])
k_fold = KFold(n_splits=13)
N = dc.shape[0]
num_trees = 30
And finally the call to my forward feature selection:
selfeat = Forward_Feature_Selection(dc,
dc["bin_spread"],
params,
num_round = num_trees,
threshold = 0,
initial_score=999,
to_exclude = [0,1,5,30,31],
nfold = 13)
Any help to understand what is happening will be greatly appreciated ! Thanks in advance for any tip !
Upvotes: 2
Views: 1113
Reputation: 753
This is normal. I have experienced the same. Firstly, Kfold is splitting differently each time. You have specified the folds in XGBoost but KFold is not splitting consistently, which is normal. Next, initial state of the model are different each time. There are inner random states withing XGBoost which can cause this too, try changing the eval metric to see if the variance reduces. If a particular metric suits your needs, try to average the best parameters and use that as your optimal parameters.
Upvotes: 1