ℕʘʘḆḽḘ
ℕʘʘḆḽḘ

Reputation: 19375

tidy predictions and confusion matrix with glmnet

Consider this example:

library(quanteda)
library(caret)
library(glmnet)
library(dplyr)

dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "Chinese Macao",
                              "Tokyo Japan Chinese"),
                     doc_id = 1:4,
                     class = c("Y", "Y", "Y", "N"))

# now we make the dataframe bigger 
dtrain <- purrr::map_df(seq_len(100), function(x) dtrain)

Lets create a sparse document-term-matrix and run some glmnet

> dtrain <- dtrain %>% mutate(class = as.factor(class))
> mycorpus <- corpus(dtrain,  text_field = 'text')
> trainingdf <- dfm(mycorpus)
> trainingdf
Document-feature matrix of: 400 documents, 6 features (62.5% sparse).

And now we finally turn to the lasso model

mymodel <- cv.glmnet(x = trainingdf, y =dtrain$class, 
                     type.measure ='class',
                     nfolds = 3,
                     alpha = 1,
                     parallel = FALSE,
                     family = 'binomial') 

I have two simple questions.

How can I add the predictions to the original dtrain data? Indeed, the mere output of

mypred <- predict.cv.glmnet(mymodel, newx = trainingdf, 
                         s = 'lambda.min', type = 'class')

looks HORRIBLY NOT TIDY:

> mypred
    1  
1   "Y"
2   "Y"
3   "Y"

How can I use caret::confusionMatrix in this setting? Just using the following creates an error:

confusion <- caret::confusionMatrix(data =mypred, 
+                                     reference = dtrain$class)
Error: `data` and `reference` should be factors with the same levels.

Thanks!

Upvotes: 0

Views: 1522

Answers (1)

RLave
RLave

Reputation: 8364

In every classification model the class for your target variable needs to be factor.

For example:

my_data is the dataset you train the model on, and my_target is the predictor.

Note that as.factor(my_data$my_target) will automatically find the correct levels for you.

By this I mean that you won't need to specify the levels by hand, but R will do it for you.

See here the difference when we call target:

target <- c("y", "n", "y", "n")
target
#[1] "y" "n" "y" "n" # this is a simple char
as.factor(target)
# [1] y n y n
# Levels: n y # this is a correct format, a factor with levels

This is important because even if your predictions (or test data) will show only one of the two classes in the target, the model will know that the actual levels can be more.

You can of course set them:

my_pred <- factor(mypred, levels = c("Y", "N"))

To add them in the data, you can use

my_data$newpred <- my_pred

or

library(dplyr)
my_data %>% mutate(newpred = my_pred)

Upvotes: 2

Related Questions