Isaac
Isaac

Reputation: 344

Matthews Correlation Coefficient yielding values outside of [-1,1]

I'm using the formula found on Wikipedia for calculating Matthew's Correlation Coefficient. It works fairly well, most of the time, but I'm running into problems in my tool's implementation, and I'm not seeing the problem.

MCC = ((TP*TN)-(FP*FN))/sqrt(((TP + FP)( TP + FN )( TN + FP )( TN + FN )))

Where TP, TN, FP, and FN are the non-negative, integer counts of the appropriate fields. Which should only return values $\epsilon$ [-1,1] My implementation is as follows:

double ret;
if ((TruePositives + FalsePositives) == 0 || (TruePositives + FalseNegatives) == 0 ||
   ( TrueNegatives + FalsePositives) == 0 || (TrueNegatives + FalseNegatives) == 0)
//To avoid dividing by zero
    ret = (double)(TruePositives * TrueNegatives - 
                     FalsePositives * FalseNegatives);

else{
    double num = (double)(TruePositives * TrueNegatives - 
                           FalsePositives * FalseNegatives);

    double denom = (TruePositives + FalsePositives) * 
                   (TruePositives + FalseNegatives) * 
                   (TrueNegatives + FalsePositives) * 
                   (TrueNegatives + FalseNegatives);
    denom = Math.Sqrt(denom);
    ret = num / denom;
                }
return ret;

When I use this, as I said it works properly most of the time, but for instance if TP=280, TN = 273, FP = 67, and FN = 20, then we get: MCC = (280*273)-(67*20)/sqrt((347*300*340*293)) = 75100/42196.06= (approx) 1.78 Is this normal behavior of Matthews Correlation Coefficient? I'm a programmer by trade, so statistics aren't a part of my formal training. Also, I've looked at questions with answers, and none of them discuss this behavior. Is it a bug in my code or in the formula itself?

Upvotes: 2

Views: 665

Answers (1)

whuber
whuber

Reputation: 2494

The code is clear and looks correct. (But one's eyes can always deceive.)

One issue is a concern whether the output is guaranteed to lie between -1 and 1. Assuming all inputs are nonnegative, though, we can round the numerator up and the denominator down, thereby overestimating the result, by zeroing out all the "False*" terms, producing

TP*TN / Sqrt(TP*TN*TP*TN) = 1.

The lower limit is obtained similarly by zeroing out all the "True*" terms. Therefore, working code cannot produce a value larger than 1 in size unless it is presented with invalid input.

I therefore recommend placing a guard (such as an Assert statement) to assure the inputs are nonnegative. (Clearly it matters not in the preceding argument whether they are integral.) Place another assertion to check that the output is in the interval [-1,1]. Together, these will detect either or both of (a) invalid inputs or (b) an error in the calculation.

Upvotes: 3

Related Questions