Aditya Jain
Aditya Jain

Reputation: 129

How this method or formula for calculating ROC AUC works?

I was trying to calculate the AUC using MySQL for the data in table like below:

y   p
1   0.872637
0   0.130633
0   0.098054
...
...
1   0.060190
0   0.110938

I came across the following SQL query which is giving the correct AUC score (I verified using sklearn method).

 SELECT (sum(y*r) - 0.5*sum(y)*(sum(y)+1)) / (sum(y) * sum(1-y)) AS auc
 FROM ( 
   SELECT y, row_number() OVER (ORDER BY p) r
   FROM probs
 ) t

 Using pandas this can be done as follows:

 temp = df.sort_values(by="p")
 temp['r'] = np.arange(1, len(df)+1, 1)
 temp['yr'] = temp['y']*temp['r']
 print( (sum(temp.yr) - 0.5*sum(temp.y)*(sum(temp.y)+1)) / (sum(temp.y) * sum(1-temp.y)) )

I did not understand how we are able to calculate AUC using this method. Can somebody please give intuition behind this?

I am already familiar with the trapezoidal method which involves summing the area of small trapezoids under the ROC curve.

Upvotes: 3

Views: 1145

Answers (1)

Kate Melnykova
Kate Melnykova

Reputation: 1873

Short answer: it is Wilcoxon-Mann-Whitney statistic, see https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve The page has proof as well.

The bottom part of your formula is identical to the formula in the wiki. The top part is trickier. f in wiki corresponds to p in your data and t_0 and t_1 are indexes in the data frame. Note that we first sort by p, which makes our life easier.

Note that the double sum may be decomposed as

Sum_{t_1 such that y(t_1)=1} #{t_0 such that p(t_0) < p(t_1) and y(t_0)=0}

Here # stands for the total number of such indexes.

For each row index t_1 (such that y(t_1) =1), how many t_0 are such that p(t_0) < p(t_1) and y(t_0)=0? We know that there are exactly t_1 values of p that are less or equal than t_1 because values are sorted. We conclude that

#{t_0: p(t_0) < p(t_1) and y(t_0)=1) = t_1 - #{t_0: t_0 <= t_1 and y(t_0)=1}

Now imagine scrolling down the sorted dataframe. For the first time we meet y=1, #{t_0: t_0 <= t_1 and y(t_0)=1}=1, for the second time we meet y=1, the same quantity is 2, for the third time we meet y=1, the quantity is 3, and so on. Therefore, when we sum the equality over all indexes t_1 when y=1, we get

Sum_{t_1: y(t_1)=1}#{t_0: p(t_0) < p(t_1) and y(t_0)=1) = Sum_{t_1: y(t_1)=1} t_1 - (1 + 2 + 3 + ... + n),

where n is the total number of ones in y column. Now we need to do one more simplification. Note that

Sum_{t_1: y(t_1)=1} t_1 = Sum_{t_1: y(t_1)=1} t_1 y(t_1)

If y(t_1) is not one, it is zero. Therefore,

Sum_{t_1: y(t_1)=1} t_1 = Sum_{t_1: y(t_1)=1} t_1 y(t_1) = Sum_{t} t y(t)

Plugging this to our formula and using that

1 + 2+ 3 + ... + n = n(n+1)/2

finished the proof of the formula you found.

P.S. I think that posting this question on math or stats overflow would make more sense.

Upvotes: 2

Related Questions