Tom
Tom

Reputation: 33

Use StratifiedGroupKFold with GridSearchCV in an XGBoost Model

Dataset: I have a very imbalanced binary dataset of Groups; approximately 362 "yes" Groups and 47000 "no" Groups. Each Group has time series data updated at minute intervals. For example, a "yes" event may have 10 times, so the there would be 10 times with the Yes class for Group X. The dataset contains 110 features. Below is an example snippet.

Group Time Class Feature 1 Feature 2 Feature 3
1 1 1 2000 1.7 30
1 2 1 2080 1.9 32
1 3 1 2070 2.1 39
2 1 0 1400 0.8 29
2 2 0 1440 0.5 26
2 3 0 1380 0.6 24
3 1 0 680 0.3 27
3 2 0 800 0.2 26
3 3 0 880 0.5 21
3 4 0 780 0.6 22

Strategy: I'm attempting to use XGBoost because of Its great reputation for working with complicated imbalanced datasets. I want to predict the rare occurrence of a yes/1 class amongst a sea of 0/no classes. I'm using GroupShuffleSplit to split the data on the Group variable into a 75/25 % train/test datasets.

After this, I want to use GridSearchCV to find the best hyperparameters. To my knowledge, it's best practice to split my training dataset into a training/validating dataset to do this. I need to make sure that I split my training/validation data on Groups, like I did earlier. This is where I'm confused on how to do it.

Below is some code that I have to do all of this. I show the remainder of my code to demonstrate how I'm taking the best hyperparameters and evaluating the model on the testing dataset.

I am not sure If I am using StratifiedGroupKFold with GridSearchCV correctly. I pass "train_data" into StratifiedGroupKFold because it contains the "Group" variable, and then fit the grid search to X_train/Y_train, which has "Group" removed so it doesn't get treated as a feature.

Any help/tips would be very much appreciated!

#Encode Group data
mydata['Group'] = le.fit_transform(mydata['Group'])

# Split dataset into train/test. Split on 'Group'
splitter = GroupShuffleSplit(train_size=.75, n_splits=2, random_state = 42)
split = splitter.split(mydata, groups=mydata['Group'])
train_inds, test_inds = next(split)
train_data = mydata.iloc[train_inds]
test_data = mydata.iloc[test_inds]

# list of Group ID's from training data
groupdata = train_data['Group'].copy().tolist()

# X and Y training data
Y_train = train_data['Class'].copy().tolist()
X_train = train_data.drop(['Class','Group'], axis=1)

# X and Y testing data
Y_test = test_data['Class'].copy().tolist()
X_test = test_data.drop(['Class','Group'], axis=1)


#Model Initialize
xgb_model = XGBClassifier(
    n_jobs=10, 
    eval_metric='error', 
    scale_pos_weight = 103, 
    n_estimators = 400, 
    learning_rate = 0.025, 
    objective='binary:logistic'
)

# hyperparemters to look through
param_grids = {
    'max_depth': [3, 4, 5, 6, 7, 8],
    'subsample': [0.3, 0.4, 0.5, 0.6],
    'colsample_bytree': [0.8, 0.9, 1.0],
    'colsample_bylevel': [0.8, 0.9, 1.0],
    'min_child_weight': [0.8, 0.9, 1.0]
}

#StratifiedGroupKFold setup 
sgkfold = StratifiedGroupKFold(n_splits=3, shuffle=True, random_state=42)
cvstrat = sgkfold.split(X=train_data, y=train_data['Severe'], groups=train_data['PSID'])

#grid search cv
grid_search = GridSearchCV(
    estimator=xgb_model, 
    param_grid = param_grids, 
    n_jobs = 10, 
    cv= cvstrat, 
    scoring='f1_macro')

#fit the grid search to the training data
grid_search.fit(X_train, Y_train)

#get best parameters
best_params = grid_search.best_params_

#create best model
best_xgb_model = XGBClassifier(
    **best_params, 
    n_jobs=10, 
    eval_metric='error', 
    scale_pos_weight = 103, 
    n_estimators = 400, 
    learning_rate = 0.025, 
    objective='binary:logistic')

#fit best model to training data
best_xgb_model.fit(X_train, Y_train)

# Make predictions on the test set
Y_pred = best_xgb_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(Y_test, Y_pred)
conf_matrix = confusion_matrix(Y_test, Y_pred)
classification_rep = classification_report(Y_test, Y_pred)
mse_xgb=mean_squared_error(Y_test, Y_pred)

I am getting poor results. Below is a confusion matrix and classification report.

[[237944 39933]

[ 884 2184]]

Classification Report:

           precision    recall  f1-score   support

       0       1.00      0.86      0.92    277877
       1       0.05      0.71      0.10      3068

I am successfully predicting the 'Yes' class 71% of the time, but with a false alarm ratio of 95%. These events are very difficult to predict, but I imagined results would be better. I wonder if trimming down my number of features would help, but I want to make sure that I am not in error with my modeling strategy.

Upvotes: 0

Views: 70

Answers (0)

Related Questions