user3115933
user3115933

Reputation: 4443

How do I generate a Decision Tree plot and a Variable Importance plot in Random Forest using R?

I am new to Data Science and I am working on a Machine Learning analysis using Random Forest algorithm to perform a classification. My target variable in my data set is called Attrition (Yes/No).

I am a bit confused as to how to generate these 2 plots in Random Fores`:

(1) Feature Importance Plot

(2) Decision Tree Plot

I understand that Random Forest is a ensemble of several Decision Tree models from the data set.

Assuming my Training data set is called TrainDf and my Testing data set is called TestDf, how can I create these 2 plots in R?

UPDATE: From these 2 posts, it seems that they cannot be done, or am I missing something here? Why is Random Forest with a single tree much better than a Decision Tree classifier?

How would you interpret an ensemble tree model?

Upvotes: 2

Views: 3265

Answers (2)

Sandipan Dey
Sandipan Dey

Reputation: 23101

Feature importance plot with ggplot2,

library(randomForest)
library(ggplot2)
mtcars.rf <- randomForest(vs ~ ., data=mtcars)
imp <- cbind.data.frame(Feature=rownames(mtcars.rf$importance),mtcars.rf$importance)
g <- ggplot(imp, aes(x=reorder(Feature, -IncNodePurity), y=IncNodePurity))
g + geom_bar(stat = 'identity') + xlab('Feature')

enter image description here

A Decision Tree plot with igraph (a tree from the random forest)

tree <- randomForest::getTree(mtcars.rf, k=1, labelVar=TRUE) # get the 1st decision tree with k=1
tree$`split var` <- as.character(tree$`split var`)
tree$`split point` <- as.character(tree$`split point`)
tree[is.na(tree$`split var`),]$`split var` <- ''
tree[tree$`split point` == '0',]$`split point` <- ''

library(igraph)
gdf <- data.frame(from = rep(rownames(tree), 2),
                          to = c(tree$`left daughter`, tree$`right daughter`))
g <- graph_from_data_frame(gdf, directed=TRUE)
V(g)$label <- paste(tree$`split var`, '\r\n(', tree$`split point`, ',', round(tree$prediction,2), ')')
g <- delete_vertices(g, '0')
print(g, e=TRUE, v=TRUE)
plot(g, layout = layout.reingold.tilford(g, root=1), vertex.size=5, vertex.color='cyan')

As can be seen from the following plot, the the label for each node in the decision tree represents the variable name chosen for split at that node, (the split value, the proportion of class with label 1) at that node.

enter image description here

Likewise the 100th tree can be obtained with k=100 with the randomForest::getTree() function which looks like the following

enter image description here

Upvotes: 2

RSK
RSK

Reputation: 753

To plot the variable importance, you can use the below code.

mtcars.rf <- randomForest(am ~ ., data=mtcars, ntree=1000, keep.forest=FALSE,
                      importance=TRUE)
varImpPlot(mtcars.rf)

Upvotes: 2

Related Questions