Reputation:
I am trying to predict probabilities in a dataset using glmnet
. My code reads:
bank <- read.table("http://www.stat.columbia.edu/~madigan/W2025/data/BankSortedMissing.TXT",header=TRUE)
bank$rich<-sample(c(0:1), 233, replace=TRUE)
train=bank[1:200,];
test=bank[201:233,]
x=model.matrix(rich~., bank)[,-1]
cv.out=cv.glmnet(x, train$rich, alpha=0, family="binomial")
ridge.mod=glmnet(x, train$rich, alpha=0, family="binomial")
bank$rich <- NULL
newx = data.matrix(test$rich)
ridge.pred=predict(ridge.mod,newx=newx)
train = data[1:2500,];
test = data[2501:5088,];
x=model.matrix(Y~x1+x2+x3+x4+x5+x6, data)[,-1]
cv.out=cv.glmnet(x, data$Y, alpha=0, family="binomial")
bestlam=cv.out$lambda.min
ridge.mod=glmnet(x, data$Y, alpha=0, family="binomial")
test$Y <- NULL
newx = data.matrix(test)
ridge.pred = predict(ridge.mod,newx=newx, type="response")
I keep getting this error message when using predict:
Error in as.matrix(cbind2(1, newx) %*% nbeta) : error in evaluating the argument 'x' in selecting a method for function 'as.matrix': Error in t(.Call(Csparse_dense_crossprod, y, t(x))) : error in evaluating the argument 'x' in selecting a method for function 't': Error: Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90
I've tried this on the "Hitters" dataset and it works perfectly fine.
library(ISLR);
library(glmnet)
Hitters=na.omit(Hitters)
Hitters$Rich<-ifelse(Hitters$Salary>500,1,0)
Hitters.train = Hitters[1:200,]
Hitters.test = Hitters[201:dim(Hitters)[1],]
x=model.matrix(Rich~.,Hitters)[,-1]
cv.out=cv.glmnet(x, Hitters$Rich, alpha=0, family="binomial")
bestlam=cv.out$lambda.min
ridge.mod=glmnet(x, Hitters$Rich, alpha=0,lambda=bestlam, family="binomial")
Hitters.test$Rich <- NULL
newx = data.matrix(Hitters.test)
ridge.pred=predict(ridge.mod,newx=newx, type="response")
head(ridge.pred)
ridge.pred[1:10,]
Does anyone know how I can fix this?
Upvotes: 9
Views: 24238
Reputation: 41
I got the same error since the training and testing datasets had different dimensions due to different factors. The problem was that the columns with factors/categorical data were defined as character columns. Thus I changed those columns from character columns to factor columns before splitting it into training and testing, and it worked!
data$factor_column_a <- as.factor(data$factor_column_a)
Upvotes: 1
Reputation: 33
ridge.mod_P@x
coef(ridge.mod,s=cv.out$lambda.min)# coeffcience of lambda.min
ridge.mod_P<-coef(ridge.mod,s=cv.out$lambda.min)
ridge.mod_P
matrix(ridge.mod_P@x)
coe<-matrix(ridge.mod_P@x)
coe2<-coe[-1,]#1
newx16<-newx[,-17]
newx16
newx16%*% matrix(coe2)# NA, This is reason of outputNA.
newx16<-newx[,-c(1,17)]
coe2<-coe[-(1:2),]#16
newx16%*% matrix(coe2)#yHat : coefficient and variable.
Upvotes: -2
Reputation: 1340
I'm posting an answer because this question still shows up in searches. The code below runs. I ran into several problems trying to replicate the example. There is missing data in bank
; I deleted those observations. Also, the generated prediction is constant (0.4875) because the ridge regression sets all variables other than the constant term to (almost) zero (not surprising with a simulated value of rich
).
library(caret) ## 6.0-81
library(glmnet) ## 2.0-16
url <- "http://www.stat.columbia.edu/~madigan/W2025/data/BankSortedMissing.TXT"
bank <- read.table(url, header=TRUE)
set.seed(1)
bank$rich <- sample(c(0:1), nrow(bank), replace=TRUE)
bank <- na.omit(bank)
trainbank <- bank[1:160, ]
testbank <- bank[161:200, ]
x <- model.matrix(rich~., trainbank)[,-1]
y <- trainbank$rich
cv.out <- cv.glmnet(x, y, alpha=0, family="binomial")
x.test <- model.matrix(rich ~ ., testbank)[,-1]
pred <- predict(cv.out, type='response', newx=x.test)
Upvotes: 0
Reputation: 61
I had the same issue and I think it is caused by training and testing set having different factors thus different dimension for the sparse matrices.
My solution is to create the sparse matrix X for the combined dataset
traintest=rbind(training,testing)
X = sparse.model.matrix(as.formula(paste("y ~", paste(colnames(training[,-1]), sep = "", collapse=" +"))), data = traintest)
model = cv.glmnet(X[1:nrow(training),], training[,1], family = "binomial",type.measure = "auc",nfolds = 10)
plot(model)
model$lambda.min
#predict on test set
pred = predict(model, s='lambda.min', newx=X[-(1:nrow(training)),], type="response")
This is just to make sure test set has the same dimension.
Upvotes: 6
Reputation: 562
I've seen this error before as well. The problem in my data set was that factor variables in my training and test sets had different number of levels. make sure that is not the case.
Upvotes: 0
Reputation: 3829
I had the same issue and I was getting the same exact error, at the end non of the above worked for me but I solved the issue! as the error states clearly, there is a "wrong dimensions" problem.
In my case I trained my glmnet
fit on a data with dimension of 36 x 895 and my test data was 6 x 6. the reason I had only 6 columns in my test dataset was that the lasso selected these 6 features when s="lambda.min"
.
I used sparse matrix from Matrix package to create a matrix (you can even use normal matrix):
sparsed_test_data <- Matrix(data=0,
nrow=nrow(test_data),
ncol=ncol(training_data),
dimnames=list(rownames(test_data),
colnames(training_data)),
sparse = T)
and then I substitute the values I had in correct columns:
for(i in colnames(test_data)){
sparsed_test_data[, i] <- test_data[, i]
}
now the predict function works fine.
Upvotes: 0
Reputation: 29
Looks like you just have the wrong thing being assigned to newx. Instead of:
bank$rich <- NULL
newx = data.matrix(test$rich)
you want to null out the values in test$rich and then feed test to data.matrix. So something like:
test$rich <- NULL
newx = data.matrix(test)
ridge.pred=predict(ridge.mod,newx=newx)
worked for me
Also, it looks like your original data frame has some patterns based on the row: rows after 200 have NA
values in newAccount
. You might want to address missing values and your train/test split before your regression
Upvotes: 2