Mor
Mor

Reputation: 11

How to create combined ROC Curve for 2 classifiers and two different data set

I have a dataset of 1127 patients. My goal was to classify each patient to 0 or 1. I have two different classifiers but with the same purpose - to classify the patient to 0 or 1. I've run one classifier on 364 patients and the second classifier on the 763 patients. for each classifier\group, I generated the ROC curve. Now, I would like to combine the curves. someone could guide me on how to do it? I'm thinking of calculating the weighted FPR and TPR, but I'm not sure how to do it. The number of FPR\TPR pairs is different between the curves (The first ROC curve based on 312 pairs and the second ROC curve based on 666 pairs).

Thanks!!!

Upvotes: 0

Views: 610

Answers (1)

Juan Kania-Morales
Juan Kania-Morales

Reputation: 588

Imports

import numpy as np
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

Data generation

# simulate first dataset with 364 obs
df1 = \
pd.DataFrame(i for i in range(364))
df1['predict_proba_1'] = np.random.normal(0,1,len(df1))
df1['epsilon'] = np.random.normal(0,1,len(df1))
df1['true'] = (0.7*df1['epsilon'] < df1['predict_proba_1']) * 1
df1 = df1.drop(columns=[0, 'epsilon'])

# simulate second dataset with 763 obs
df2 = \
pd.DataFrame(i for i in range(763))
df2['predict_proba_2'] = np.random.normal(0,1,len(df2))
df2['epsilon'] = np.random.normal(0,1,len(df2))
df2['true'] = (0.7*df2['epsilon'] < df2['predict_proba_2']) * 1
df2 = df2.drop(columns=[0, 'epsilon'])

Quick look at generated data

df1
     predict_proba_1  true
0           1.234549     1
1          -0.586544     0
2          -0.229539     1
3           0.132185     1
4          -0.411284     0
..               ...   ...
359        -0.218775     0
360        -0.985565     0
361         0.542790     1
362        -0.463667     0
363         1.119244     1

[364 rows x 2 columns]

df2
     predict_proba_2  true
0           0.278755     1
1           0.653663     0
2          -0.304216     1
3           0.955658     1
4          -1.341669     0
..               ...   ...
758         1.359606     1
759        -0.605894     0
760         0.379738     0
761         1.571615     1
762        -1.102565     0

[763 rows x 2 columns]

Necessary functions

def show_ROCs(scores_list: list, ys_list: list, labels_list:list = None):
    """
    This function plots a couple of ROCs. Corresponding labels are optional.

    Parameters
    ----------
    scores_list : list of array-likes with scorings or predicted probabilities.
    ys_list : list of array-likes with ground true labels.
    labels_list : list of labels to be displayed in plotted graph.

    Returns
    ----------
    None

    """
    if len(scores_list) != len(ys_list):
        raise Exception('len(scores_list) != len(ys_list)')
    fpr_dict = dict()
    tpr_dict = dict()
    for x in range(len(scores_list)):
        fpr_dict[x], tpr_dict[x], _ = roc_curve(ys_list[x], scores_list[x])
    for x in range(len(scores_list)):
        try:
            plot_ROC(fpr_dict[x], tpr_dict[x], str(labels_list[x]) + ' AUC:' + str(round(auc(fpr_dict[x], tpr_dict[x]),3)))
        except:
            plot_ROC(fpr_dict[x], tpr_dict[x], str(x) + ' ' + str(round(auc(fpr_dict[x], tpr_dict[x]),3)))
    plt.show()

def plot_ROC(fpr, tpr, label):
    """
    This function plots a single ROC. Corresponding label is optional.

    Parameters
    ----------
    fpr : array-likes with fpr.
    tpr : array-likes with tpr.
    label : label to be displayed in plotted graph.

    Returns
    ----------
    None

    """
    plt.figure(1)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.plot(fpr, tpr, label=label)
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('ROC curve')
    plt.legend(loc='best')

Plotting

show_ROCs(
    [df1['predict_proba_1'], df2['predict_proba_2']],
    [df1['true'], df2['true']],
    ['df1 with {} obs'.format(len(df1)), 'df2 with {} obs'.format(len(df2))]
)

The Image You Want

Upvotes: 0

Related Questions