Reputation: 49
I have a dataset that looks like this:
ID Class Predicted Probabilities
1 1 0.592
2 1 0.624
3 0 0.544
4 0 0.194
5 0 0.328
6 1 0.504
. . .
. . .
I have been tasked to calculate the AUC manually...but not sure how!
I know how to calculate the TPR and FPR to create a ROC curve. How would I be able to use the data to calculate the AUC? No libraries like scikit-learn allowed. I've looked everywhere but can't seem to find a proper answer. Thanks, everyone!
Upvotes: 2
Views: 4823
Reputation: 517
You'll need to calculate the true positive and false positive rates using your predicted and true class while varying your class threshold (T), i.e. the cut-off you use to predict whether an observation falls into class 0 or 1.
You'll need a dataset with a header that looks like...
ID, Predicted Probability, Predicted Class, True Class, Threshold, True Positive Flag, False Positive Flag
(see https://en.wikipedia.org/wiki/Receiver_operating_characteristic for details). If you look at the Wiki page you'll notice they even provide a quick and easy discrete estimation within "Area under curve".
AUC stands for "area under the curve" so you'll likely need to perform some sort of numerical integration. In this context, TPR will be your Y and FPR your X at each value of T.
You could try and use something like the trapezoidal rule (https://en.wikipedia.org/wiki/Trapezoidal_rule) if you wanna keep it simple.
You can use numpy.trapz (see: https://docs.scipy.org/doc/numpy/reference/generated/numpy.trapz.html) if you don't want to implement this yourself but it's not difficult to build from scratch either (see: Trapezoidal rule in Python).
You should be able to write functions for these in Python using only math and numpy pretty easily. In fact, you might not need any libraries at all.
Upvotes: 1