Reputation: 23
I have to implement a classification algorithm on a medicinal dataset. So i thought it was crucial to have good recall on disease regognition. I wanted to implement scorer like this
recall_scorer = make_scorer(recall_score(y_true = , y_pred = , \
labels =['compensated_hypothyroid', 'primary_hypothyroid'], average = 'macro'))
But then, I would like to use this scorer in GridSearchCV, so it will fit on KFold for me. So, i wouldn't know how to initialize scorer as it needs to be passed y_true and y_pred immediately.
How do i go around this problem? Am I to write my own hyperparameter tuning?
Upvotes: 2
Views: 1650
Reputation: 14062
As per your comment, calculating the recall during the Cross-Validation
iterations for only two classes is doable in Scikit-learn
.
Consider this dataset example:
You can use the make_scorer function to grab the metadata during the Cross-Validation
as follows:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score, make_scorer
from sklearn.model_selection import GridSearchCV, StratifiedKFold, StratifiedShuffleSplit
import numpy as np
def getDataset(path, x_attr, y_attr, mapping):
"""
Extract dataset from CSV file
:param path: location of csv file
:param x_attr: list of Features Names
:param y_attr: Y header name in CSV file
:param mapping: dictionary of the classes integers
:return: tuple, (X, Y)
"""
df = pd.read_csv(path)
df.replace(mapping, inplace=True)
X = np.array(df[x_attr]).reshape(len(df), len(x_attr))
Y = np.array(df[y_attr])
return X, Y
def custom_recall_score(y_true, y_pred):
"""
Workaround for the recall score
:param y_true: Ground Truth during iterations
:param y_pred: Y predicted during iterations
:return: float, recall
"""
wanted_labels = [0, 1]
assert set(wanted_labels).issubset(y_true)
wanted_indices = [y_true.tolist().index(x) for x in wanted_labels]
wanted_y_true = [y_true[x] for x in wanted_indices]
wanted_y_pred = [y_pred[x] for x in wanted_indices]
recall_ = recall_score(wanted_y_true, wanted_y_pred,
labels=wanted_labels, average='macro')
print("Wanted Indices: {}".format(wanted_indices))
print("Wanted y_true: {}".format(wanted_y_true))
print("Wanted y_pred: {}".format(wanted_y_pred))
print("Recall during cross validation: {}".format(recall_))
return recall_
def run(X_data, Y_data):
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
train_index, test_index = next(sss.split(X_data, Y_data))
X_train, X_test = X_data[train_index], X_data[test_index]
Y_train, Y_test = Y_data[train_index], Y_data[test_index]
param_grid = {'C': [0.1, 1]} # or whatever parameter you want
# I am using LR just for example
model = LogisticRegression(solver='saga', random_state=0)
clf = GridSearchCV(model, param_grid,
cv=StratifiedKFold(n_splits=2),
return_train_score=True,
scoring=make_scorer(custom_recall_score))
clf.fit(X_train, Y_train)
print(clf.cv_results_)
X_data, Y_data = getDataset("dataset_example.csv", ['TSH', 'T4'], 'diagnosis',
{'compensated_hypothyroid': 0, 'primary_hypothyroid': 1,
'hyperthyroid': 2, 'normal': 3})
run(X_data, Y_data)
Wanted Indices: [3, 5]
Wanted y_true: [0, 1]
Wanted y_pred: [3, 3]
Recall during cross validation: 0.0
...
...
Wanted Indices: [0, 4]
Wanted y_true: [0, 1]
Wanted y_pred: [1, 1]
Recall during cross validation: 0.5
...
...
{'param_C': masked_array(data=[0.1, 1], mask=[False, False],
fill_value='?', dtype=object),
'mean_score_time': array([0.00094521, 0.00086224]),
'mean_fit_time': array([0.00298035, 0.0023526 ]),
'std_score_time': array([7.02142715e-05, 1.78813934e-06]),
'mean_test_score': array([0.21428571, 0.5 ]),
'std_test_score': array([0.24743583, 0. ]),
'params': [{'C': 0.1}, {'C': 1}],
'mean_train_score': array([0.25, 0.5 ]),
'std_train_score': array([0.25, 0. ]),
....
....}
You must use StratifiedShuffleSplit and StratifiedKFold and have a balanced classes in your dataset to ensure a stratified distribution of classes during iterations, otherwise the assertion
above may complain!
Upvotes: 3