citraL
citraL

Reputation: 1156

Can't reproduce Xgb.cv cross-validation results

I am using Python 3.5 and python implementation of XGBoost, version 0.6

I built a forward feature selection routine in Python, which iteratively builds the optimal set of features (leading to the best score, here metric is binary classification error).

On my data set, using xgb.cv routine, I can get down to an error rate of around 0.21 by increasing max_depth (of trees) up to 40...

But then if I do a custom cross-validation, using the same XG Boost parameters, same folds, same metric and same data set, I reach the best score being 0.70 with max_depth of 4 ... if I use the optimal max_depth obtained by my xgb.cv routine, my score drops to 0.65 ... I just don't understand what is happening ...

My best guess is that xgb.cv is using different folds (i.e. shuffles the data before partitioning), but I also think I submit the folds as an input to xgb.cv (with option Shuffle=False) ... so, it might be something completely different ...

Here is the code of the forward_feature_selection (using xgb.cv):

def Forward_Feature_Selection(train, y_train, params, num_round=30, threshold=0, initial_score=0.5, to_exclude = [], nfold = 5):

    k_fold = KFold(n_splits=13)
    selected_features = []
    gain = threshold + 1
    previous_best_score = initial_score
    train = train.drop(train.columns[to_exclude], axis=1)  # df.columns is zero-based pd.Index 
    features = train.columns.values
    selected = np.zeros(len(features))
    scores = np.zeros(len(features))
    while (gain > threshold):    # we start a add-a-feature loop
        for i in range(0,len(features)):
            if (selected[i]==0):   # take only features not yet selected
                selected_features.append(features[i])
                new_train = train.iloc[:][selected_features]
                selected_features.remove(features[i])
                dtrain = xgb.DMatrix(new_train, y_train, missing = None)
            #    dtrain = xgb.DMatrix(pd.DataFrame(new_train), y_train, missing = None)
                if (i % 10 == 0):
                    print("Launching XGBoost for feature "+ str(i))
                xgb_cv = xgb.cv(params, dtrain, num_round, nfold=13, folds=k_fold, shuffle=False) 
                if params['objective'] == 'binary:logistic':
                    scores[i] = xgb_cv.tail(1)["test-error-mean"]   #classification
                else:
                    scores[i] = xgb_cv.tail(1)["test-rmse-mean"]    #regression
            else:
                scores[i] = initial_score    # discard already selected variables from candidates
        best = np.argmin(scores)
        gain = previous_best_score - scores[best]
        if (gain > 0):        
            previous_best_score = scores[best]  
            selected_features.append(features[best])
            selected[best] = 1

        print("Adding feature: " + features[best] + " increases score by " + str(gain) + ". Final score is now: " + str(previous_best_score)) 
    return (selected_features, previous_best_score)

and here is my "custom" cross validation:

mean_error_rate = 0
for train, test in k_fold.split(ds):
    dtrain =  xgb.DMatrix(pd.DataFrame(ds.iloc[train]), dc.iloc[train]["bin_spread"], missing = None)
    gbm = xgb.train(params, dtrain, 30)
    dtest =  xgb.DMatrix(pd.DataFrame(ds.iloc[test]), dc.iloc[test]["bin_spread"], missing = None)
    res.ix[test,"pred"] = gbm.predict(dtest)

    cv_reg = reg.fit(pd.DataFrame(ds.iloc[train]), dc.iloc[train]["bin_spread"])
    res.ix[test,"lasso"] = cv_reg.predict(pd.DataFrame(ds.iloc[test]))

    res.ix[test,"y_xgb"] = res.loc[test,"pred"] > 0.5
    res.ix[test, "xgb_right"] = (res.loc[test,"y_xgb"]==res.loc[test,"bin_spread"]) 
    print (str(100*np.sum(res.loc[test, "xgb_right"])/(N/13)))
    mean_error_rate += 100*(np.sum(res.loc[test, "xgb_right"])/(N/13))
print("mean_error_rate is : " + str(mean_error_rate/13))  

using the following parameters:

params = {"objective": "binary:logistic", 
          "booster":"gbtree",
          "max_depth":4, 
          "eval_metric" : "error",
          "eta" : 0.15}
res = pd.DataFrame(dc["bin_spread"]) 
k_fold = KFold(n_splits=13)
N = dc.shape[0]
num_trees = 30

And finally the call to my forward feature selection:

selfeat = Forward_Feature_Selection(dc, 
                                    dc["bin_spread"], 
                                    params, 
                                    num_round = num_trees,
                                    threshold = 0,
                                    initial_score=999,
                                    to_exclude = [0,1,5,30,31],
                                    nfold = 13)

Any help to understand what is happening will be greatly appreciated ! Thanks in advance for any tip !

Upvotes: 2

Views: 1113

Answers (1)

Abhishek Vijayan
Abhishek Vijayan

Reputation: 753

This is normal. I have experienced the same. Firstly, Kfold is splitting differently each time. You have specified the folds in XGBoost but KFold is not splitting consistently, which is normal. Next, initial state of the model are different each time. There are inner random states withing XGBoost which can cause this too, try changing the eval metric to see if the variance reduces. If a particular metric suits your needs, try to average the best parameters and use that as your optimal parameters.

Upvotes: 1

Related Questions