Dominik
Dominik

Reputation: 792

issue with predict with glmnetUtils

trying to use the glmnetUtils package from GitHub for formula interface to glmnet but predict is not estimating enough values

library(nycflights13) # from GitHub
library(modelr)
library(dplyr)
library(glmnet)
library(glmnetUtils)
library(purrr)


fitfun=function(dF){
  cv.glmnet(arr_delay~distance+air_time+dep_time,data=dF)
}
gnetr2=function(model,datavals){
  yvar=all.vars(formula(model)[[2]])
  print(paste('y variable:',yvar))
  print('observations')
  print(str(as.data.frame(datavals)[[yvar]]))
  print('predictions')
  print(str(predict(object=model,newdata=datavals)))
  stats::cor(stats::predict(object=model, newdata=datavals), as.data.frame(datavals)[[yvar]], use='complete.obs')^2
}


flights %>% 
  group_by(carrier) %>% 
  do({
    crossv_mc(.,4) %>% 
      mutate(mdl=map(train,fitfun),
             r2=map2_dbl(mdl,test,gnetr2))
  })

the output from gnetr2():

[1] "y variable: arr_delay"
[1] "observations"
 num [1:3693] -33 -6 47 4 15 -5 45 16 0 NA ...
NULL
[1] "predictions"
 num [1:3476, 1] 8.22 21.75 24.31 -7.96 -7.27 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:3476] "1" "2" "3" "4" ...
  ..$ : chr "1"
NULL
Error: incompatible dimensions

any ideas what's going on? your help is much appreciated!

Upvotes: 0

Views: 506

Answers (2)

Dominik
Dominik

Reputation: 792

Turns out its happening because there are NA in the predictor variables so predict() results in a shorter vector since na.action=na.exclude.

Normally a solution would be to use predict(object,newdata,na.action=na.pass) but predict.cv.glmnet does not accept other arguments to predict.

Therefore the solution is to filter for complete cases before beginning

flights=flights %>% filter(complete.cases(.))

Upvotes: 0

Hong Ooi
Hong Ooi

Reputation: 57696

This is an issue with the underlying glmnet package, but there's no reason that it can't be handled in glmnetUtils. I've just pushed an update that should let you use the na.action argument with the predict method for formula-based calls.

  • Setting na.action=na.pass (the default) will pad out the predictions to include NAs for rows with missing values
  • na.action=na.omit or na.exclude will drop these rows

Note that the missingness of a given row may change depending on how much regularisation is done: if the NAs are for variables that get dropped from the model, then the row will be counted as being a complete case.

Also took the opportunity to fix a bug where the LHS of the formula contains an expression.

Give it a go with install_github("Hong-Revo/glmnetUtils") and tell me if anything breaks.

Upvotes: 1

Related Questions