Goldman Clarck
Goldman Clarck

Reputation: 77

Simple Decision Tree in R - Strange Results From Caret Package

I'm trying to apply a simple decision tree to the following data set using the caret package, the data is:

> library(caret)
> mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
> mydata$rank <- factor(mydata$rank)
  # create dummy variables
> X = predict(dummyVars(~ ., data=mydata), mydata)
> head(X)

    A matrix: 6 × 7 of type dbl     
admit   gre gpa rank.1  rank.2  rank.3  rank.4
    0   380 3.61    0        0        1      0
    1   660 3.67    0        0        1      0
    1   800 4.00    1        0        0      0
    1   640 3.19    0        0        0      1
    0   520 2.93    0        0        0      1
    1   760 3.00    0        1        0      0

Splitting into a training and testing set:

> trainset <- data.frame(X[1:300,])
> testset <- data.frame(X[301:400,])

Now applying the decision tree:

> tree <- train(factor(admit) ~., data = trainset, method = "rpart")
> tree

CART 

300 samples
  6 predictor
  2 classes: '0', '1' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 300, 300, 300, 300, 300, 300, ... 
Resampling results across tuning parameters:

 cp          Accuracy   Kappa    
0.01956522  0.6856163  0.1865179
0.03260870  0.6888378  0.1684015
0.08695652  0.7080434  0.1079462

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.08695652.

I get NaN in variable importance! Why?

> varImp(tree)$importance

A data.frame: 6 × 1     Overall
<dbl>
gre NaN
gpa NaN
rank.1  NaN
rank.2  NaN
rank.3  NaN
rank.4  NaN

and in prediction the decision tree only outputs one class, the 0 class, why? What's wrong with my code? Thanks in advance.

> y_pred <- predict(tree ,newdata=testset)
> y_test <- factor(testset$admit)
> confusionMatrix(y_pred, factor(y_test))

Confusion Matrix and Statistics

      Reference
Prediction  0  1
         0 65 35
         1  0  0

           Accuracy : 0.65            
             95% CI : (0.5482, 0.7427)
No Information Rate : 0.65            
P-Value [Acc > NIR] : 0.5458          

              Kappa : 0               

Mcnemar's Test P-Value : 9.081e-09       

        Sensitivity : 1.00            
        Specificity : 0.00            
     Pos Pred Value : 0.65            
     Neg Pred Value :  NaN            
         Prevalence : 0.65            
     Detection Rate : 0.65            
 Detection Prevalence : 1.00            
  Balanced Accuracy : 0.50            

   'Positive' Class : 0           

Upvotes: 1

Views: 1600

Answers (1)

Martin Gal
Martin Gal

Reputation: 16998

I can't answer your question, but I can show you the way I use to calculate decision trees:

library(data.table)
library(tidyverse)
library(caret)
library(rpart)
library(rpart.plot)

# Reading data into data.table
mydata <- fread("https://stats.idre.ucla.edu/stat/data/binary.csv")

# converting rank and admit to factors
mydata$rank  <- as.factor(mydata$rank)
mydata$admit <- as.factor(mydata$admit)

# creating train and test data
t_index  <- createDataPartition(mydata$admit, p=0.75, list=FALSE)
trainset <- mydata[t_index,]
testset  <- mydata[-t_index,]

# calculating the model using rpart
model <- rpart(admit ~ .,
               data = trainset,
               parms = list(split="information"),
               method = "class")

# plotting the decision tree
model %>%
  rpart.plot(digits = 4)

# get confusion matrix
model %>%
  predict(testset, type = "class") %>%
  table(testset$admit) %>%
  confusionMatrix()

Perhaps this helps you a bit.

Upvotes: 1

Related Questions