Reputation: 33
Dataset: I have a very imbalanced binary dataset of Groups; approximately 362 "yes" Groups and 47000 "no" Groups. Each Group has time series data updated at minute intervals. For example, a "yes" event may have 10 times, so the there would be 10 times with the Yes class for Group X. The dataset contains 110 features. Below is an example snippet.
Group | Time | Class | Feature 1 | Feature 2 | Feature 3 |
---|---|---|---|---|---|
1 | 1 | 1 | 2000 | 1.7 | 30 |
1 | 2 | 1 | 2080 | 1.9 | 32 |
1 | 3 | 1 | 2070 | 2.1 | 39 |
2 | 1 | 0 | 1400 | 0.8 | 29 |
2 | 2 | 0 | 1440 | 0.5 | 26 |
2 | 3 | 0 | 1380 | 0.6 | 24 |
3 | 1 | 0 | 680 | 0.3 | 27 |
3 | 2 | 0 | 800 | 0.2 | 26 |
3 | 3 | 0 | 880 | 0.5 | 21 |
3 | 4 | 0 | 780 | 0.6 | 22 |
Strategy: I'm attempting to use XGBoost because of Its great reputation for working with complicated imbalanced datasets. I want to predict the rare occurrence of a yes/1 class amongst a sea of 0/no classes. I'm using GroupShuffleSplit to split the data on the Group variable into a 75/25 % train/test datasets.
After this, I want to use GridSearchCV to find the best hyperparameters. To my knowledge, it's best practice to split my training dataset into a training/validating dataset to do this. I need to make sure that I split my training/validation data on Groups, like I did earlier. This is where I'm confused on how to do it.
Below is some code that I have to do all of this. I show the remainder of my code to demonstrate how I'm taking the best hyperparameters and evaluating the model on the testing dataset.
I am not sure If I am using StratifiedGroupKFold with GridSearchCV correctly. I pass "train_data" into StratifiedGroupKFold because it contains the "Group" variable, and then fit the grid search to X_train/Y_train, which has "Group" removed so it doesn't get treated as a feature.
Any help/tips would be very much appreciated!
#Encode Group data
mydata['Group'] = le.fit_transform(mydata['Group'])
# Split dataset into train/test. Split on 'Group'
splitter = GroupShuffleSplit(train_size=.75, n_splits=2, random_state = 42)
split = splitter.split(mydata, groups=mydata['Group'])
train_inds, test_inds = next(split)
train_data = mydata.iloc[train_inds]
test_data = mydata.iloc[test_inds]
# list of Group ID's from training data
groupdata = train_data['Group'].copy().tolist()
# X and Y training data
Y_train = train_data['Class'].copy().tolist()
X_train = train_data.drop(['Class','Group'], axis=1)
# X and Y testing data
Y_test = test_data['Class'].copy().tolist()
X_test = test_data.drop(['Class','Group'], axis=1)
#Model Initialize
xgb_model = XGBClassifier(
n_jobs=10,
eval_metric='error',
scale_pos_weight = 103,
n_estimators = 400,
learning_rate = 0.025,
objective='binary:logistic'
)
# hyperparemters to look through
param_grids = {
'max_depth': [3, 4, 5, 6, 7, 8],
'subsample': [0.3, 0.4, 0.5, 0.6],
'colsample_bytree': [0.8, 0.9, 1.0],
'colsample_bylevel': [0.8, 0.9, 1.0],
'min_child_weight': [0.8, 0.9, 1.0]
}
#StratifiedGroupKFold setup
sgkfold = StratifiedGroupKFold(n_splits=3, shuffle=True, random_state=42)
cvstrat = sgkfold.split(X=train_data, y=train_data['Severe'], groups=train_data['PSID'])
#grid search cv
grid_search = GridSearchCV(
estimator=xgb_model,
param_grid = param_grids,
n_jobs = 10,
cv= cvstrat,
scoring='f1_macro')
#fit the grid search to the training data
grid_search.fit(X_train, Y_train)
#get best parameters
best_params = grid_search.best_params_
#create best model
best_xgb_model = XGBClassifier(
**best_params,
n_jobs=10,
eval_metric='error',
scale_pos_weight = 103,
n_estimators = 400,
learning_rate = 0.025,
objective='binary:logistic')
#fit best model to training data
best_xgb_model.fit(X_train, Y_train)
# Make predictions on the test set
Y_pred = best_xgb_model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(Y_test, Y_pred)
conf_matrix = confusion_matrix(Y_test, Y_pred)
classification_rep = classification_report(Y_test, Y_pred)
mse_xgb=mean_squared_error(Y_test, Y_pred)
I am getting poor results. Below is a confusion matrix and classification report.
[[237944 39933]
[ 884 2184]]
Classification Report:
precision recall f1-score support
0 1.00 0.86 0.92 277877
1 0.05 0.71 0.10 3068
I am successfully predicting the 'Yes' class 71% of the time, but with a false alarm ratio of 95%. These events are very difficult to predict, but I imagined results would be better. I wonder if trimming down my number of features would help, but I want to make sure that I am not in error with my modeling strategy.
Upvotes: 0
Views: 70