Reputation: 792
trying to use the glmnetUtils
package from GitHub for formula interface to glmnet but predict is not estimating enough values
library(nycflights13) # from GitHub
library(modelr)
library(dplyr)
library(glmnet)
library(glmnetUtils)
library(purrr)
fitfun=function(dF){
cv.glmnet(arr_delay~distance+air_time+dep_time,data=dF)
}
gnetr2=function(model,datavals){
yvar=all.vars(formula(model)[[2]])
print(paste('y variable:',yvar))
print('observations')
print(str(as.data.frame(datavals)[[yvar]]))
print('predictions')
print(str(predict(object=model,newdata=datavals)))
stats::cor(stats::predict(object=model, newdata=datavals), as.data.frame(datavals)[[yvar]], use='complete.obs')^2
}
flights %>%
group_by(carrier) %>%
do({
crossv_mc(.,4) %>%
mutate(mdl=map(train,fitfun),
r2=map2_dbl(mdl,test,gnetr2))
})
the output from gnetr2():
[1] "y variable: arr_delay"
[1] "observations"
num [1:3693] -33 -6 47 4 15 -5 45 16 0 NA ...
NULL
[1] "predictions"
num [1:3476, 1] 8.22 21.75 24.31 -7.96 -7.27 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:3476] "1" "2" "3" "4" ...
..$ : chr "1"
NULL
Error: incompatible dimensions
any ideas what's going on? your help is much appreciated!
Upvotes: 0
Views: 506
Reputation: 792
Turns out its happening because there are NA in the predictor variables so predict()
results in a shorter vector since na.action=na.exclude
.
Normally a solution would be to use predict(object,newdata,na.action=na.pass)
but predict.cv.glmnet
does not accept other arguments to predict
.
Therefore the solution is to filter for complete cases before beginning
flights=flights %>% filter(complete.cases(.))
Upvotes: 0
Reputation: 57696
This is an issue with the underlying glmnet package, but there's no reason that it can't be handled in glmnetUtils. I've just pushed an update that should let you use the na.action
argument with the predict
method for formula-based calls.
na.action=na.pass
(the default) will pad out the predictions to include NAs for rows with missing valuesna.action=na.omit
or na.exclude
will drop these rowsNote that the missingness of a given row may change depending on how much regularisation is done: if the NAs are for variables that get dropped from the model, then the row will be counted as being a complete case.
Also took the opportunity to fix a bug where the LHS of the formula contains an expression.
Give it a go with install_github("Hong-Revo/glmnetUtils")
and tell me if anything breaks.
Upvotes: 1