XGBoost showing same prediction for all test data

Question

I am working on a problem to predict output label based on certain input values. Since I do not have real data, I am creating some dummy data so that I can have my code ready by the time I get the data. Below is what the sample data looks like. There are a bunch of input values and the last column 'output' is the output label to be predicted.

input_1,input_2,input_3,input_4,input_5,input_6,input_7,input_8,input_9,input_10,input_11,input_12,input_13,input_14,input_15,input_16,input_17,input_18,input_19,input_20,input_21,input_22,input_23,input_24,input_25,input_26,input_27,input_28,input_29,input_30,input_31,input_32,output
0.0,97.0,155,143,98,145,102,102,144,100,96,193,90,98,98,122,101,101,101,98,99,96,118,148,98,99,112,94,98,100,96.0,95,loc12
96.0,94.0,116,99,98,105,95,101,168,101,96,108,95,98,98,96,102,98,98,99,98,98,132,150,102,101,195,104,96,97,93.0,98,loc27

Since this is dummy data, I am setting the output label to the input that has the maximum value. For e.g. in the first row, the maximum value is at 12th location so output is set to loc12. My expectation is that the XGBoost algorithm should learn this on its own and predict the output label correctly.

I have written below code to train and test XGBoost.

from __future__ import division
import numpy as np
import pandas as pd
import scipy.sparse
import pickle
import xgboost as xgb
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, LabelBinarizer

df=pd.read_csv("data.txt", sep=',')

# Create training and validation sets
sz = df.shape
train = df.iloc[:int(sz[0] * 0.7), :]
test = df.iloc[int(sz[0] * 0.7):, :]

# Separate X & Y for training
train_X = train.iloc[:, :32].values
train_Y = train.iloc[:, 32].values

# Separate X & Y for test
test_X = test.iloc[:, :32].values
test_Y = test.iloc[:, 32].values

# Get the count of  unique output labels
num_classes = df.output.nunique()

lb = LabelBinarizer()
train_Y = lb.fit_transform(train_Y.tolist())
test_Y = lb.fit_transform(test_Y.tolist())

# Normalize the training data
#train_X -= np.mean(train_X, axis=0)
#train_X /= np.std(train_X, axis=0)
#train_X /= 255

# Normalize the test data
#test_X -= np.mean(test_X, axis=0)
#test_X /= np.std(test_X, axis=0)
#test_X /= 255

xg_train = xgb.DMatrix(train_X, label=train_Y)
xg_test = xgb.DMatrix(test_X, label=test_Y)

# setup parameters for xgboost
param = {}
# use softmax multi-class classification
param['objective'] = 'multi:softmax'
# scale weight of positive examples
param['eta'] = 0.1
param['max_depth'] = 6
param['silent'] = 1
param['nthread'] = 4
param['num_class'] = num_classes

watchlist = [(xg_train, 'train'), (xg_test, 'test')]
num_round = 5
bst = xgb.train(param, xg_train, num_round, watchlist)
#bst.dump_model('dump.raw.txt')
# get prediction
pred = bst.predict(xg_test)
actual = np.argmax(test_Y, axis=1)
error_rate = np.sum(pred != actual) / test_Y.shape[0]
print('Test error using softmax = {}'.format(error_rate))

# do the same thing again, but output probabilities
param['objective'] = 'multi:softprob'
bst = xgb.train(param, xg_train, num_round, watchlist)
# Note: this convention has been changed since xgboost-unity
# get prediction, this is in 1D array, need reshape to (ndata, nclass)
pred_prob = bst.predict(xg_test).reshape(test_Y.shape[0], num_classes)
pred_label = np.argmax(pred_prob, axis=1)
actual_label = np.argmax(test_Y, axis=1)
error_rate = np.sum(pred_label != actual_label) / test_Y.shape[0]
print('Test error using softprob = {}'.format(error_rate))

However I am observing that it is always predicting label 0, i.e. first index in the one-hot encoded output.

Output:

[0] train-merror:0.11081    test-merror:0.111076
[1] train-merror:0.11081    test-merror:0.111076
[2] train-merror:0.11081    test-merror:0.111076
[3] train-merror:0.111216   test-merror:0.111076
[4] train-merror:0.11081    test-merror:0.111076
Test error using softmax = 0.64846954875355
[0] train-merror:0.11081    test-merror:0.111076
[1] train-merror:0.11081    test-merror:0.111076
[2] train-merror:0.11081    test-merror:0.111076
[3] train-merror:0.111216   test-merror:0.111076
[4] train-merror:0.11081    test-merror:0.111076
Test error using softprob = 0.64846954875355

Prediction:

pred_prob[0:10]
array([[0.34024397, 0.10218474, 0.07965304, 0.07965304, 0.07965304,
        0.07965304, 0.07965304, 0.07965304, 0.07965304],
       [0.34009758, 0.10257103, 0.07961877, 0.07961877, 0.07961877,
        0.07961877, 0.07961877, 0.07961877, 0.07961877],
       [0.34421352, 0.09171014, 0.08058234, 0.08058234, 0.08058234,
        0.08058234, 0.08058234, 0.08058234, 0.08058234],
       [0.33950377, 0.10413795, 0.07947975, 0.07947975, 0.07947975,
        0.07947975, 0.07947975, 0.07947975, 0.07947975],
       [0.3426607 , 0.09580766, 0.08021881, 0.08021881, 0.08021881,
        0.08021881, 0.08021881, 0.08021881, 0.08021881],
       [0.33777002, 0.10427278, 0.07970817, 0.07970817, 0.07970817,
        0.07970817, 0.07970817, 0.07970817, 0.07970817],
       [0.33733884, 0.10985068, 0.07897293, 0.07897293, 0.07897293,
        0.07897293, 0.07897293, 0.07897293, 0.07897293],
       [0.33953893, 0.10404517, 0.07948799, 0.07948799, 0.07948799,
        0.07948799, 0.07948799, 0.07948799, 0.07948799],
       [0.33987975, 0.10314585, 0.07956778, 0.07956778, 0.07956778,
        0.07956778, 0.07956778, 0.07956778, 0.07956778],
       [0.34013695, 0.10246711, 0.07962799, 0.07962799, 0.07962799,
        0.07962799, 0.07962799, 0.07962799, 0.07962799]], dtype=float32)

Whatever accuracy I'm getting is because of predicting label 0 which is around 35% of the data.

Is my expectation correct here? Are the input features too many and data too little for it to learn properly?

Full code: Here

Test Data: Here

XGBoost showing same prediction for all test data

Answers (1)

Related Questions