Reputation: 2265
I ran that example from the rpart-manpage
tree <- rpart(Species~., data = iris)
plot(tree,margin=0.1)
text(tree)
Now I want to modify that, for another dataset
digitstrainURL <- "http://archive.ics.uci.edu/ml/machine-learning-databases/pendigits/pendigits.tra"
digitsTestURL <- "http://archive.ics.uci.edu/ml/machine-learning-databases/pendigits/pendigits.tes"
digitstrain <- read.table(digitstrainURL, sep=",",
col.names=c("i1","i2","i3","i4","i5","i6","i7","i8","i9","i10","i11","i12","i13","i14","i15","i16", "Class"))
digitstest <- read.table(digitsTestURL, sep=",",
col.names=c("i1","i2","i3","i4","i5","i6","i7","i8","i9","i10","i11","i12","i13","i14","i15","i16", "Class"))
tree <- rpart(Class~., data = digitstrain)
plot(tree,margin=0.1)
text(tree)
The dataset contains data for handwriten digits, and the "Class" holds the Digit 0-9 But when I plot the Tree, I get weired floating point numbers as result, any Idea what these numbers mean? I would prefer to have 0-9 as text for the leafs.
Upvotes: 0
Views: 889
Reputation: 179558
You are trying to fit a classification tree, but your data is integers, not factors.
The function rpart
will try to guess what method to use, and in your case is making the wrong guess. So your code fits a tree based on method="anova"
, whereas you want to use method="class"
.
Try this:
tree <- rpart(Class~., data = digitstrain, method="class")
plot(tree,margin=0.1)
text(tree, cex=0.7)
To test the accuracy of your model, you can use predict
to get predicted values and then create a confusion matrix:
confusion <- data.frame(
class=factor(digitstest$Class),
predict=predict(tree, digitstest, type="class")
)
with(confusion, table(class, predict))
predict
class 0 1 2 3 4 5 6 7 8 9
0 311 1 0 0 0 0 0 7 42 2
1 0 139 186 4 0 0 0 1 10 24
2 0 0 320 14 2 3 0 7 15 3
3 0 6 0 309 1 3 0 17 0 0
4 0 1 0 5 300 0 0 0 0 58
5 0 0 0 74 0 177 0 1 14 69
6 5 0 3 9 12 0 264 11 5 27
7 2 9 11 13 0 10 0 290 0 29
8 60 0 0 0 0 32 0 21 220 3
9 1 44 0 9 20 0 0 8 0 254
Note that the prediction using a single tree isn't great. A very easy way to improve the prediction is to use a random forest, consisting of many trees fitted with random subsets of your training data:
library(randomForest)
fst <- randomForest(factor(Class)~., data = digitstrain, method="class")
Observe that the forest gives far superior prediction results:
confusion <- data.frame(
class=factor(digitstest$Class),
predict=predict(fst, digitstest, type="class")
)
with(confusion, table(class, predict))
predict
class 0 1 2 3 4 5 6 7 8 9
0 347 0 0 0 0 0 0 0 16 0
1 0 333 28 1 1 0 0 1 0 0
2 0 5 359 0 0 0 0 0 0 0
3 0 4 0 331 0 0 0 0 0 1
4 0 0 0 0 362 1 0 0 0 1
5 0 0 0 8 0 316 0 0 0 11
6 1 0 0 0 0 0 335 0 0 0
7 0 26 2 0 0 0 0 328 0 8
8 0 0 0 0 0 0 0 0 336 0
9 0 2 0 0 0 0 0 2 1 331
Upvotes: 1
Reputation: 4126
It is happening because your Class column is numeric. Convert it to factor then try...
digitstrain$Class = as.factor(digitstrain$Class)
tree <- rpart(Class~., data = digitstrain)
plot(tree,margin=0.1)
text(tree)
Result would be
Upvotes: 0