Reputation: 1
Though the below code “works” (in that it does not give an error), I get very high AUCs which makes me wonder if it somehow skips over the actual type of cross-validation I am trying to make it conduct.
Each group indicates the collection of data coming from a given participant. So, at every fold, all data of a participant is held out for testing, a model is made from all the data of the remaining participants, and then tested on the left-out participant’s data later on. I tried to conduct shuffling as the order of tasks is the same for all participants. I also did normalization. I am using the scikit-learn library.
Are there things that are incorrect (or something that increases overfitting here)? Is this the actual way to implement the LOGO? My data (as in features) does not have the target and the task number.
Secondary Question: Does the model overtrain, etc. if I run it multiple times with different 'scoring' argument (See code). I am using the cross_validate function, and although I have seen examples of getting multiple metrics (auc, accuracy, etc.) at the same time, that code did not work for me for some reason. Is it okay if I run the code block several times to change the scoring section to get the different values I need?
import numpy as np
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score
from sklearn.ensemble import GradientBoostingClassifier
# First, setting a random seed for the code to be reproducible
random_seed = 200
np.random.seed(random_seed)
#Defining our variables, using our previously created dictionary
X = data_dict['data']
Y = data_dict['target']
groups = data_dict['participants']
#Calling the LeaveOneGroupOut function from scikit
logo = LeaveOneGroupOut()
# Within each participant's group, shuffling the order of tasks
X_shuffled = []
Y_shuffled = []
groups_shuffled = []
unique_groups = np.unique(groups) #Each group representing a participant
for group in unique_groups:
group_indices = np.where(groups == group)[0]
shuffled_indices = np.random.permutation(group_indices)
X_shuffled.extend(X[shuffled_indices])
Y_shuffled.extend(Y[shuffled_indices])
groups_shuffled.extend(groups[shuffled_indices])
X_shuffled = np.array(X_shuffled)
Y_shuffled = np.array(Y_shuffled)
groups_shuffled = np.array(groups_shuffled)
# Creating the RandomForestClassifier
clf = GradientBoostingClassifier(random_state=random_seed)
# The pipeline first imputes missing data using the mean of each feature, then scales/normalizes features, and then implements
#the random forest classifier
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()), # Apply normalization
('clf', clf)
])
# Conducting the cross-validation with shuffling of task order within each participant's group
results_logo = cross_validate(pipeline, X_shuffled, Y_shuffled, cv=logo.split(X_shuffled, Y_shuffled, groups_shuffled),
scoring='roc_auc', return_train_score=True, return_estimator=True)
print('auc')
print('training score: %.4f' % results_logo['train_score'].mean())
print('test score: %.4f' % results_logo['test_score'].mean())
print(results_logo['test_score'])
print(np.mean(results_logo['test_score']))
for i, (train_index, test_index) in enumerate(logo.split(X_shuffled, Y_shuffled, groups_shuffled)):
print(f"Fold {i}:")
print(f" Train: index={train_index}, group={groups_shuffled[train_index]}")
print(f" Test: index={test_index}, group={groups_shuffled[test_index]}")
Upvotes: 0
Views: 520
Reputation: 124
Regarding your second question:
Does the model overtrain, etc. if I run it multiple times with different 'scoring' argument
No. A scoring metric is calculated on a given fitted model, so you can calculate both, say, accuracy and recall on the same fitted model. You're not affecting the model's coefficients, which were already calculated during fitting the model, so there's no problem in using multiple lines of code, each calculating a different metric.
Upvotes: 0