Kasia Danilczuk
Kasia Danilczuk

Reputation:

Predict function error for probabilities in glmnet?

I am trying to predict probabilities in a dataset using glmnet. My code reads:

bank <- read.table("http://www.stat.columbia.edu/~madigan/W2025/data/BankSortedMissing.TXT",header=TRUE)
bank$rich<-sample(c(0:1), 233, replace=TRUE)
    train=bank[1:200,];
    test=bank[201:233,]
    x=model.matrix(rich~., bank)[,-1]
    cv.out=cv.glmnet(x, train$rich, alpha=0, family="binomial")
ridge.mod=glmnet(x, train$rich, alpha=0, family="binomial")
    bank$rich <- NULL
newx = data.matrix(test$rich)
ridge.pred=predict(ridge.mod,newx=newx)

train = data[1:2500,];
test = data[2501:5088,];
x=model.matrix(Y~x1+x2+x3+x4+x5+x6, data)[,-1]
cv.out=cv.glmnet(x, data$Y, alpha=0, family="binomial")
    bestlam=cv.out$lambda.min
ridge.mod=glmnet(x, data$Y, alpha=0, family="binomial")
    test$Y <- NULL
newx = data.matrix(test)
ridge.pred = predict(ridge.mod,newx=newx, type="response")

I keep getting this error message when using predict:

Error in as.matrix(cbind2(1, newx) %*% nbeta) : error in evaluating the argument 'x' in selecting a method for function 'as.matrix': Error in t(.Call(Csparse_dense_crossprod, y, t(x))) : error in evaluating the argument 'x' in selecting a method for function 't': Error: Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90

I've tried this on the "Hitters" dataset and it works perfectly fine.

library(ISLR);
library(glmnet)
Hitters=na.omit(Hitters)

Hitters$Rich<-ifelse(Hitters$Salary>500,1,0)
Hitters.train = Hitters[1:200,]
Hitters.test = Hitters[201:dim(Hitters)[1],]
x=model.matrix(Rich~.,Hitters)[,-1]
cv.out=cv.glmnet(x, Hitters$Rich, alpha=0, family="binomial")
    bestlam=cv.out$lambda.min
ridge.mod=glmnet(x, Hitters$Rich, alpha=0,lambda=bestlam, family="binomial")
    Hitters.test$Rich <- NULL
newx = data.matrix(Hitters.test)
ridge.pred=predict(ridge.mod,newx=newx, type="response")
head(ridge.pred)
ridge.pred[1:10,]

Does anyone know how I can fix this?

Upvotes: 9

Views: 24238

Answers (7)

Spyros
Spyros

Reputation: 41

I got the same error since the training and testing datasets had different dimensions due to different factors. The problem was that the columns with factors/categorical data were defined as character columns. Thus I changed those columns from character columns to factor columns before splitting it into training and testing, and it worked!

data$factor_column_a <- as.factor(data$factor_column_a)

Upvotes: 1

heeseon
heeseon

Reputation: 33

ridge.mod_P@x  
coef(ridge.mod,s=cv.out$lambda.min)# coeffcience of lambda.min  
ridge.mod_P<-coef(ridge.mod,s=cv.out$lambda.min)  
ridge.mod_P  
matrix(ridge.mod_P@x)  
coe<-matrix(ridge.mod_P@x)  
coe2<-coe[-1,]#1  
newx16<-newx[,-17]  
newx16
newx16%*% matrix(coe2)# NA, This is reason of outputNA.
newx16<-newx[,-c(1,17)]  
coe2<-coe[-(1:2),]#16  
newx16%*% matrix(coe2)#yHat : coefficient and variable.

Upvotes: -2

Robert McDonald
Robert McDonald

Reputation: 1340

I'm posting an answer because this question still shows up in searches. The code below runs. I ran into several problems trying to replicate the example. There is missing data in bank; I deleted those observations. Also, the generated prediction is constant (0.4875) because the ridge regression sets all variables other than the constant term to (almost) zero (not surprising with a simulated value of rich).

library(caret) ## 6.0-81
library(glmnet) ## 2.0-16
url <- "http://www.stat.columbia.edu/~madigan/W2025/data/BankSortedMissing.TXT"
bank <- read.table(url, header=TRUE)
set.seed(1)
bank$rich <- sample(c(0:1), nrow(bank), replace=TRUE)
bank <- na.omit(bank)
trainbank <- bank[1:160, ]
testbank <- bank[161:200, ]
x <- model.matrix(rich~., trainbank)[,-1]
y <- trainbank$rich
cv.out <- cv.glmnet(x, y, alpha=0, family="binomial")
x.test <- model.matrix(rich ~ ., testbank)[,-1]
pred <- predict(cv.out, type='response', newx=x.test)

Upvotes: 0

Ruge
Ruge

Reputation: 61

I had the same issue and I think it is caused by training and testing set having different factors thus different dimension for the sparse matrices.

My solution is to create the sparse matrix X for the combined dataset

traintest=rbind(training,testing)

X = sparse.model.matrix(as.formula(paste("y ~", paste(colnames(training[,-1]), sep = "", collapse=" +"))), data = traintest)
model = cv.glmnet(X[1:nrow(training),], training[,1], family = "binomial",type.measure = "auc",nfolds = 10)
plot(model)
model$lambda.min
#predict on test set
pred = predict(model, s='lambda.min', newx=X[-(1:nrow(training)),], type="response")

This is just to make sure test set has the same dimension.

Upvotes: 6

ekardes
ekardes

Reputation: 562

I've seen this error before as well. The problem in my data set was that factor variables in my training and test sets had different number of levels. make sure that is not the case.

Upvotes: 0

Mehrad
Mehrad

Reputation: 3829

I had the same issue and I was getting the same exact error, at the end non of the above worked for me but I solved the issue! as the error states clearly, there is a "wrong dimensions" problem.

About my data

In my case I trained my glmnet fit on a data with dimension of 36 x 895 and my test data was 6 x 6. the reason I had only 6 columns in my test dataset was that the lasso selected these 6 features when s="lambda.min".

My solution

I used sparse matrix from Matrix package to create a matrix (you can even use normal matrix):

sparsed_test_data <- Matrix(data=0,
                            nrow=nrow(test_data),
                            ncol=ncol(training_data),
                            dimnames=list(rownames(test_data),
                                          colnames(training_data)),
                            sparse = T)

and then I substitute the values I had in correct columns:

for(i in colnames(test_data)){
    sparsed_test_data[, i] <- test_data[, i]
}

now the predict function works fine.

Upvotes: 0

jimu
jimu

Reputation: 29

Looks like you just have the wrong thing being assigned to newx. Instead of:

bank$rich <- NULL newx = data.matrix(test$rich)

you want to null out the values in test$rich and then feed test to data.matrix. So something like: test$rich <- NULL newx = data.matrix(test) ridge.pred=predict(ridge.mod,newx=newx) worked for me

Also, it looks like your original data frame has some patterns based on the row: rows after 200 have NA values in newAccount. You might want to address missing values and your train/test split before your regression

Upvotes: 2

Related Questions