gojomoso
gojomoso

Reputation: 163

Caret Predict Target Variable nrow() is Null

df:

library(caret)

a = c("aa", "bb", "cc", "aa", "aa", "aa", "bb", "cc", "bb", "bb") 
b = c("aa", "bb", "cc", "aa", "aa", "aa", "bb", "cc", "bb", "bb") 
c = c("aa", "bb", "cc", "aa", "aa", "aa", "bb", "cc", "bb", "bb") 
d = c("aa", "bb", "cc", "aa", "aa", "aa", "bb", "cc", "bb", "bb") 
e = c(1, 0, 1, 0, 0, 0, 1, 1, 1, 1)

#df1
df1 = data.frame(a,b,c,d,e)
#df2
df2 = data.frame(a,b,c,d,e)

Caret Log-red model:

df1$e <- as.factor(df1$e)
df2$e <- as.factor(df2$e)

# define training control
train_control <- trainControl(method = "cv", number = 5)

# train the model on training set
model <- train(e ~ .,
               data = df1,
               trControl = train_control,
               method = "glm",
               family=binomial())

# logistic <- glm(WonLost ~ . -PANum, data=train, family="binomial")
df2$predict <- caret::predict.train(model, newdata=df2,type = "prob")


nrow(df2$predict)
nrow(df2$e)

Why is nrow(df2$e) zero? I changed the target variable to a factor based on a previous error I was getting below but that seems to have caused my current issue.

Warning messages: 1: In train.default(x, y, weights = w, ...) : You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.

Upvotes: 0

Views: 217

Answers (1)

Duck
Duck

Reputation: 39585

Sometimes caret is sensible to variables, even with factors your glm logit model has troubles about regression or classification. One suggestion that I learned is recoding the target variables to Yes/No. Also, be careful that predictions from caret are being added as new dataframe in df2 that is why nrow() works whereas e is just a vector so you have to use length() or NROW(). Here the code:

library(caret)
#Vectors
a = c("aa", "bb", "cc", "aa", "aa", "aa", "bb", "cc", "bb", "bb") 
b = c("aa", "bb", "cc", "aa", "aa", "aa", "bb", "cc", "bb", "bb") 
c = c("aa", "bb", "cc", "aa", "aa", "aa", "bb", "cc", "bb", "bb") 
d = c("aa", "bb", "cc", "aa", "aa", "aa", "bb", "cc", "bb", "bb") 
e = c(1, 0, 1, 0, 0, 0, 1, 1, 1, 1)

#df1
df1 = data.frame(a,b,c,d,e)
#df2
df2 = data.frame(a,b,c,d,e)
#Format
df1$e[df1$e==1] <- 'Yes'
df1$e[df1$e==0] <- 'No'
df2$e[df2$e==1] <- 'Yes'
df2$e[df2$e==0] <- 'No'

# define training control
train_control <- trainControl(method = "cv", number = 5)

# train the model on training set
model <- train(e ~ .,
               data = df1,
               trControl = train_control,
               method = "glm",
               family=binomial())

#Predict
df2$predict <- caret::predict.train(model, newdata=df2,type = "prob")
#Checks
nrow(df2$predict)
NROW(df2$e)
length(df2$e)

Outputs:

df2
    a  b  c  d   e   predict.No predict.Yes
1  aa aa aa aa Yes 7.500000e-01        0.25
2  bb bb bb bb  No 2.500000e-01        0.75
3  cc cc cc cc Yes 8.646869e-09        1.00
4  aa aa aa aa  No 7.500000e-01        0.25
5  aa aa aa aa  No 7.500000e-01        0.25
6  aa aa aa aa  No 7.500000e-01        0.25
7  bb bb bb bb Yes 2.500000e-01        0.75
8  cc cc cc cc Yes 8.646869e-09        1.00
9  bb bb bb bb Yes 2.500000e-01        0.75
10 bb bb bb bb Yes 2.500000e-01        0.75

nrow(df2$predict)
[1] 10
NROW(df2$e)
[1] 10
length(df2$e)
[1] 10

Upvotes: 1

Related Questions