Reputation: 129
I was trying to calculate the AUC using MySQL for the data in table like below:
y p
1 0.872637
0 0.130633
0 0.098054
...
...
1 0.060190
0 0.110938
I came across the following SQL query which is giving the correct AUC score (I verified using sklearn method).
SELECT (sum(y*r) - 0.5*sum(y)*(sum(y)+1)) / (sum(y) * sum(1-y)) AS auc
FROM (
SELECT y, row_number() OVER (ORDER BY p) r
FROM probs
) t
Using pandas this can be done as follows:
temp = df.sort_values(by="p")
temp['r'] = np.arange(1, len(df)+1, 1)
temp['yr'] = temp['y']*temp['r']
print( (sum(temp.yr) - 0.5*sum(temp.y)*(sum(temp.y)+1)) / (sum(temp.y) * sum(1-temp.y)) )
I did not understand how we are able to calculate AUC using this method. Can somebody please give intuition behind this?
I am already familiar with the trapezoidal method which involves summing the area of small trapezoids under the ROC curve.
Upvotes: 3
Views: 1145
Reputation: 1873
Short answer: it is Wilcoxon-Mann-Whitney statistic, see https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve The page has proof as well.
The bottom part of your formula is identical to the formula in the wiki. The top part is trickier. f
in wiki corresponds to p
in your data and t_0
and t_1
are indexes in the data frame. Note that we first sort by p
, which makes our life easier.
Note that the double sum may be decomposed as
Sum_{t_1 such that y(t_1)=1} #{t_0 such that p(t_0) < p(t_1) and y(t_0)=0}
Here #
stands for the total number of such indexes.
For each row index t_1
(such that y(t_1) =1
), how many t_0
are such that p(t_0) < p(t_1)
and y(t_0)=0
? We know that there are exactly t_1
values of p
that are less or equal than t_1
because values are sorted. We conclude that
#{t_0: p(t_0) < p(t_1) and y(t_0)=1) = t_1 - #{t_0: t_0 <= t_1 and y(t_0)=1}
Now imagine scrolling down the sorted dataframe. For the first time we meet y=1
, #{t_0: t_0 <= t_1 and y(t_0)=1}=1
, for the second time we meet y=1
, the same quantity is 2, for the third time we meet y=1
, the quantity is 3, and so on. Therefore, when we sum the equality over all indexes t_1
when y=1
, we get
Sum_{t_1: y(t_1)=1}#{t_0: p(t_0) < p(t_1) and y(t_0)=1) = Sum_{t_1: y(t_1)=1} t_1 - (1 + 2 + 3 + ... + n),
where n
is the total number of ones in y
column. Now we need to do one more simplification. Note that
Sum_{t_1: y(t_1)=1} t_1 = Sum_{t_1: y(t_1)=1} t_1 y(t_1)
If y(t_1)
is not one, it is zero. Therefore,
Sum_{t_1: y(t_1)=1} t_1 = Sum_{t_1: y(t_1)=1} t_1 y(t_1) = Sum_{t} t y(t)
Plugging this to our formula and using that
1 + 2+ 3 + ... + n = n(n+1)/2
finished the proof of the formula you found.
P.S. I think that posting this question on math or stats overflow would make more sense.
Upvotes: 2