user2165379
user2165379

Reputation: 479

Why is rpart more accurate than Caret rpart in R

This post mentions that Caret rpart is more accurate than rpart due to bootstrapping and cross validation:

Why do results using caret::train(..., method = "rpart") differ from rpart::rpart(...)?

Although when I compare both methods, I get an accuracy of 0.4879 for Caret rpart and 0.7347 for rpart (I have copied my code below).

Besides that the classificationtree for Caret rpart has only a few nodes (splits) compared to rpart

Does anyone understand these differences?

Thank you!

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Loading libraries and the data

This is an R Markdown document. First we load the libraries and the data and split the trainingdata into a training and a testset.

```{r section1, echo=TRUE}

# load libraries
library(knitr)
library(caret)
suppressMessages(library(rattle))
library(rpart.plot)

# set the URL for the download
wwwTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
wwwTest  <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

# download the datasets
training <- read.csv(url(wwwTrain))
testing  <- read.csv(url(wwwTest))

# create a partition with the training dataset 
inTrain  <- createDataPartition(training$classe, p=0.05, list=FALSE)
TrainSet <- training[inTrain, ]
TestSet  <- training[-inTrain, ]
dim(TrainSet)

# set seed for reproducibility        
set.seed(12345)

```
## Cleaning the data

```{r section2, echo=TRUE}
# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(TrainSet)
TrainSet <- TrainSet[, -NZV]
TestSet  <- TestSet[, -NZV]
dim(TrainSet)
dim(TestSet)

# remove variables that are mostly NA
AllNA    <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, AllNA==FALSE]
TestSet  <- TestSet[, AllNA==FALSE]
dim(TrainSet)
dim(TestSet)

# remove identification only variables (columns 1 to 5)
TrainSet <- TrainSet[, -(1:5)]
TestSet  <- TestSet[, -(1:5)]
dim(TrainSet)


```

## Prediction modelling

First we build a classification model using Caret with the rpart method:
```{r section4, echo=TRUE}

mod_rpart <- train(classe ~ ., method = "rpart", data = TrainSet)

pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)

mod_rpart$finalModel
fancyRpartPlot(mod_rpart$finalModel)

```

Second we build a similar model using rpart:
```{r section7, echo=TRUE}

# model fit
set.seed(12345)
modFitDecTree <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(modFitDecTree)

# prediction on Test dataset
predictDecTree <- predict(modFitDecTree, newdata=TestSet, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, TestSet$classe)
confMatDecTree

```

Upvotes: 1

Views: 3393

Answers (1)

missuse
missuse

Reputation: 19756

A simple explanation is that you did not tune either models, and at the default settings rpart performed better by pure chance.

When you do use the same parameters then you should expect the same performance.

Lets do some tuning with caret:

set.seed(1)
mod_rpart <- train(classe ~ .,
                   method = "rpart",
                   data = TrainSet,
                   tuneLength = 50, 
                   metric = "Accuracy",
                   trControl = trainControl(method = "repeatedcv",
                                            number = 4,
                                            repeats = 5,
                                            summaryFunction = multiClassSummary,
                                            classProbs = TRUE))

pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
#output
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 4359  243   92  135   38
         B  446 2489  299  161  276
         C  118  346 2477  300   92
         D  190  377  128 2240  368
         E  188  152  254  219 2652

Overall Statistics

               Accuracy : 0.7628          
                 95% CI : (0.7566, 0.7688)
    No Information Rate : 0.2844          
    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.7009          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.8223   0.6900   0.7622   0.7332   0.7741
Specificity            0.9619   0.9214   0.9444   0.9318   0.9466
Pos Pred Value         0.8956   0.6780   0.7432   0.6782   0.7654
Neg Pred Value         0.9316   0.9253   0.9495   0.9469   0.9490
Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
Detection Rate         0.2339   0.1335   0.1329   0.1202   0.1423
Detection Prevalence   0.2611   0.1970   0.1788   0.1772   0.1859
Balanced Accuracy      0.8921   0.8057   0.8533   0.8325   0.8603

that is a bit better then rpart with default settings (cp = 0.01)

how about if we set the optimal cp as chosen by caret:

modFitDecTree <- rpart(classe ~ .,
                       data = TrainSet,
                       method = "class",
                       control = rpart.control(cp = mod_rpart$bestTune))

predictDecTree <- predict(modFitDecTree, newdata = TestSet, type = "class" )
confusionMatrix(predictDecTree, TestSet$classe)
#part of ouput
Accuracy : 0.7628   

Upvotes: 4

Related Questions