user2071938
user2071938

Reputation: 2265

Decision Trees with R

I ran that example from the rpart-manpage

tree <- rpart(Species~., data = iris)
plot(tree,margin=0.1)
text(tree)

Now I want to modify that, for another dataset

digitstrainURL <- "http://archive.ics.uci.edu/ml/machine-learning-databases/pendigits/pendigits.tra"
digitsTestURL <- "http://archive.ics.uci.edu/ml/machine-learning-databases/pendigits/pendigits.tes"
digitstrain <- read.table(digitstrainURL, sep=",",
                          col.names=c("i1","i2","i3","i4","i5","i6","i7","i8","i9","i10","i11","i12","i13","i14","i15","i16", "Class"))
digitstest <- read.table(digitsTestURL, sep=",",
col.names=c("i1","i2","i3","i4","i5","i6","i7","i8","i9","i10","i11","i12","i13","i14","i15","i16", "Class"))

tree <- rpart(Class~., data = digitstrain)
plot(tree,margin=0.1)
text(tree)

The dataset contains data for handwriten digits, and the "Class" holds the Digit 0-9 But when I plot the Tree, I get weired floating point numbers as result, any Idea what these numbers mean? I would prefer to have 0-9 as text for the leafs.

Upvotes: 0

Views: 889

Answers (2)

Andrie
Andrie

Reputation: 179558

You are trying to fit a classification tree, but your data is integers, not factors.

The function rpart will try to guess what method to use, and in your case is making the wrong guess. So your code fits a tree based on method="anova", whereas you want to use method="class".

Try this:

tree <- rpart(Class~., data = digitstrain, method="class")
plot(tree,margin=0.1)
text(tree, cex=0.7)

enter image description here

To test the accuracy of your model, you can use predict to get predicted values and then create a confusion matrix:

confusion <- data.frame(
  class=factor(digitstest$Class), 
  predict=predict(tree, digitstest, type="class")
  )
with(confusion, table(class, predict))

     predict
class   0   1   2   3   4   5   6   7   8   9
    0 311   1   0   0   0   0   0   7  42   2
    1   0 139 186   4   0   0   0   1  10  24
    2   0   0 320  14   2   3   0   7  15   3
    3   0   6   0 309   1   3   0  17   0   0
    4   0   1   0   5 300   0   0   0   0  58
    5   0   0   0  74   0 177   0   1  14  69
    6   5   0   3   9  12   0 264  11   5  27
    7   2   9  11  13   0  10   0 290   0  29
    8  60   0   0   0   0  32   0  21 220   3
    9   1  44   0   9  20   0   0   8   0 254

Note that the prediction using a single tree isn't great. A very easy way to improve the prediction is to use a random forest, consisting of many trees fitted with random subsets of your training data:

library(randomForest)

fst <- randomForest(factor(Class)~., data = digitstrain, method="class")

Observe that the forest gives far superior prediction results:

confusion <- data.frame(
  class=factor(digitstest$Class), 
  predict=predict(fst, digitstest, type="class")
  )
with(confusion, table(class, predict))

     predict
class   0   1   2   3   4   5   6   7   8   9
    0 347   0   0   0   0   0   0   0  16   0
    1   0 333  28   1   1   0   0   1   0   0
    2   0   5 359   0   0   0   0   0   0   0
    3   0   4   0 331   0   0   0   0   0   1
    4   0   0   0   0 362   1   0   0   0   1
    5   0   0   0   8   0 316   0   0   0  11
    6   1   0   0   0   0   0 335   0   0   0
    7   0  26   2   0   0   0   0 328   0   8
    8   0   0   0   0   0   0   0   0 336   0
    9   0   2   0   0   0   0   0   2   1 331

Upvotes: 1

vrajs5
vrajs5

Reputation: 4126

It is happening because your Class column is numeric. Convert it to factor then try...

digitstrain$Class = as.factor(digitstrain$Class)
tree <- rpart(Class~., data = digitstrain)
plot(tree,margin=0.1)
text(tree)

Result would be

enter image description here

Upvotes: 0

Related Questions