Mark Nh
Mark Nh

Reputation: 33

Does cv.glmnet in R return double MSE for binary data?

I noticed that the minimized values for the MSE (mean of squared errors) and MAE (mean of absolute values of the errors) criteria returned by "$cvm" or "plot()" in cv.glmnet are twice the actual values when the outcome is binary. This is very confusing. There must be some reason for this. Is this explained somewhere or do I understand something wrong?

I published a figure with MAE-values larger than 0.5 on the y-axis (binary outcome) produced by a code similar to what I posted below. This actually implies that my predictions would be worse than random guessing. So I got a complaint from a referee which I could not answer. And when I calculate MAE from the predicted values, it seems to be correct and it is exactly 50% of the value what cv.glmnet gives. I find this discrepancy extremely irritating.

# sample R code to illustrate huge discrepancy between MSE and MAE
# values from cv.glmnet when compared to prediction errors
# calculated applying a test data set 

# generate binomial data
n <- 10000
set.seed(1234)
x1 <- runif(n, -2, 2); x2 <- runif(n, -2, 2)
p <- exp(x1 + x2)/(1 + exp(x1 + x2))
y <- rbinom(n, 1, p)
  # Note: if the line above is replaced by  y <- 5*p + rnorm(n)
  #  and later family="gaussian" in cv.glmnet() 
  #  then there are no similar discrepancies as with binary data

# predictors
x <- matrix(0, n, 10)
for (k in 1:ncol(x)) {a <- runif(1) 
 x[, k] <- 0.5*(a*x1 + (1-a)*x2 + runif(n, -2, 2))}

# training data
xa <- x[seq(n/2), ]; ya <- y[seq(n/2)]
# test data
xb <- x[-1*seq(n/2), ]; yb <- y[-1*seq(n/2)]

install.packages("glmnet")
library(glmnet)
# set alpha=1 i.e. apply Lasso regression with optimal regularization
# parameter lambda chosen by MSE or MAE criterion
cvfit_MSE <- cv.glmnet(xa, ya, family="binomial", alpha=1, 
                       type.measure="mse", nfolds=10)
cvfit_MAE <- cv.glmnet(xa, ya, family="binomial", alpha=1, 
                       type.measure="mae", nfolds=10)

min(cvfit_MSE$cvm) # MSE=0.3734 (mean of squared errors)
min(cvfit_MAE$cvm) # MAE=0.7477 (mean of absolute values of the errors)
plot(cvfit_MSE) 
           # also this has min 0.3734 on y-axis (labelled as "MSE")
plot(cvfit_MAE) # here y-axis (labelled as MAE) starts at 0.75 
                # but even random guessing would produce MAE about 0.5, 
                # extremely confusing... should y-axis label be 2*MAE??

# obtain predictions for the test data:
pp <- predict(cvfit_MSE, newx=xb, s="lambda.min", type="response")

# compare these to the actual observed values (variable "yb") in test data.
# surprisingly, MSE and MAE from these predictions are half of those from cv.glmnet:
mean((pp-yb)^2) # MSE=0.17650 (mean of squared errors)
mean(abs(pp-yb)) # MAE=0.36380 (mean of absolute values of the errors)

Upvotes: 2

Views: 334

Answers (2)

Mark Nh
Mark Nh

Reputation: 33

This is just to add a few lines to the excellent answer by @benbolker:

Apparently, for categorical outcomes, Brier score is (one of) the preferred measures for predictive accuracy. It is defined as mean squared difference between predicted group probabilities and actual group membership (1 for the correct group, 0 for all other groups) taking into account the "prediction errors" for all groups, not just the correct group. Of note, for k>2 (number of categories, k=2 for binary) this differs from such "version" of MSE, which would take only the correct group into account ignoring the predicted probabilities for the other groups.

But for k=2, Brier score is just twice the usual MSE. Wikipedia notes that usually, when k=2, Brier score is defined as half of the score originally proposed by Brier. So, apparently, as glmnet is using the same code for binary and other categorical responses, they apply the original formula of Brier score and call that MSE and also apply the same analogue for MAE. This is confusing because this is not the usual definition of MSE or MAE when there are only 2 groups.

Upvotes: 1

Ben Bolker
Ben Bolker

Reputation: 226911

I don't know where this is justified/described, but I can point you to the actual code, which is in glmnet:::cv.lognet (lognet is the subclass of glmnet model that applies to fitting binomial data).

The formula used is

(y[, 1] - (1 - predmat))^2 + (y[, 2] - predmat)^2

which effectively treats a binary outcome as a (0,1) or (1,0) pair and computes the error for both the "failures" (0 outcomes) and "successes" (1 outcomes).

For example:

predmat <- 0.2
y <- matrix(c(0, 1), nrow = 1)
mse = (y[, 1] - (1 - predmat))^2 + 
    (y[, 2] - predmat)^2

which gives 2*0.8^2 = 1.28 rather than 0.8^2 = 0.64 exactly as you point out.

Clearly for binary data the two parts of this expression will be equivalent. It's probably written this way to generalize more easily to binomial data with N>1 (in R these data are typically represented as a two-column matrix of failures and successes).

I would guess that the authors only cared that the MSE be proportionally correct, so that it would be useful for picking the right model ... I guess they could have written it as

(y[,2]/rowSums(y) - predmat)^2

instead.

Upvotes: 1

Related Questions