Reputation: 77
I'm trying to apply a simple decision tree to the following data set using the caret package, the data is:
> library(caret)
> mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
> mydata$rank <- factor(mydata$rank)
# create dummy variables
> X = predict(dummyVars(~ ., data=mydata), mydata)
> head(X)
A matrix: 6 × 7 of type dbl
admit gre gpa rank.1 rank.2 rank.3 rank.4
0 380 3.61 0 0 1 0
1 660 3.67 0 0 1 0
1 800 4.00 1 0 0 0
1 640 3.19 0 0 0 1
0 520 2.93 0 0 0 1
1 760 3.00 0 1 0 0
Splitting into a training and testing set:
> trainset <- data.frame(X[1:300,])
> testset <- data.frame(X[301:400,])
Now applying the decision tree:
> tree <- train(factor(admit) ~., data = trainset, method = "rpart")
> tree
CART
300 samples
6 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 300, 300, 300, 300, 300, 300, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.01956522 0.6856163 0.1865179
0.03260870 0.6888378 0.1684015
0.08695652 0.7080434 0.1079462
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.08695652.
I get NaN
in variable importance! Why?
> varImp(tree)$importance
A data.frame: 6 × 1 Overall
<dbl>
gre NaN
gpa NaN
rank.1 NaN
rank.2 NaN
rank.3 NaN
rank.4 NaN
and in prediction the decision tree only outputs one class, the 0 class, why? What's wrong with my code? Thanks in advance.
> y_pred <- predict(tree ,newdata=testset)
> y_test <- factor(testset$admit)
> confusionMatrix(y_pred, factor(y_test))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 65 35
1 0 0
Accuracy : 0.65
95% CI : (0.5482, 0.7427)
No Information Rate : 0.65
P-Value [Acc > NIR] : 0.5458
Kappa : 0
Mcnemar's Test P-Value : 9.081e-09
Sensitivity : 1.00
Specificity : 0.00
Pos Pred Value : 0.65
Neg Pred Value : NaN
Prevalence : 0.65
Detection Rate : 0.65
Detection Prevalence : 1.00
Balanced Accuracy : 0.50
'Positive' Class : 0
Upvotes: 1
Views: 1600
Reputation: 16998
I can't answer your question, but I can show you the way I use to calculate decision trees:
library(data.table)
library(tidyverse)
library(caret)
library(rpart)
library(rpart.plot)
# Reading data into data.table
mydata <- fread("https://stats.idre.ucla.edu/stat/data/binary.csv")
# converting rank and admit to factors
mydata$rank <- as.factor(mydata$rank)
mydata$admit <- as.factor(mydata$admit)
# creating train and test data
t_index <- createDataPartition(mydata$admit, p=0.75, list=FALSE)
trainset <- mydata[t_index,]
testset <- mydata[-t_index,]
# calculating the model using rpart
model <- rpart(admit ~ .,
data = trainset,
parms = list(split="information"),
method = "class")
# plotting the decision tree
model %>%
rpart.plot(digits = 4)
# get confusion matrix
model %>%
predict(testset, type = "class") %>%
table(testset$admit) %>%
confusionMatrix()
Perhaps this helps you a bit.
Upvotes: 1