Decision Tree Issue: Why does tree() not pick all variables for the nodes

Question

I am working on a project at my work place and I am running into some issues with my decision tree analysis. THIS IS NOT A HOMEWORK ASSIGNMENT. Sample dataset

PRODUCT_SUB_LINE_DESCR    MAJOR_CATEGORY_DESCR       CUST_REGION_DESCR
SUNDRY                         SMALL EQUIP           NORTH EAST REGION
SUNDRY                         SMALL EQUIP           SOUTH EAST REGION
SUNDRY                         SMALL EQUIP           SOUTH EAST REGION
SUNDRY                         SMALL EQUIP           NORTH EAST REGION
SUNDRY                         PREVENTIVE            SOUTH CENTRAL REGION
SUNDRY                         PREVENTIVE            SOUTH EAST REGION
SUNDRY                         PREVENTIVE            SOUTH EAST REGION
SUNDRY                         SMALL EQUIP           NORTH CENTRAL REGION
SUNDRY                         SMALL EQUIP           MOUNTAIN WEST REGION
SUNDRY                         SMALL EQUIP           MOUNTAIN WEST REGION
SUNDRY                         COMPOSITE             NORTH CENTRAL REGION
SUNDRY                         COMPOSITE             NORTH CENTRAL REGION
SUNDRY                         COMPOSITE             OHIO VALLEY REGION
SUNDRY                         COMPOSITE             NORTH EAST REGION

Sales   QtySold      MFGCOST    MarginDollars   new_ProductName
209.97  3             134.55    72.72            no
-76.15  -1            -44.85    -30.4            no
275.6   2             162.5     109.84           no
138.7   1             81.25     55.82            no
226     2             136       87.28            no
115     1             68        45.64            no
210.7   2             136       71.98            no
29      1             18.85     9.77             no
29      1             18.85     9.77             no
46.32   2             37.7      7.86             no
159.86  1             132.4     24.81            no
441.3   2             264.8     171.2            no
209.62  1             132.4     74.57            no
209.62  1             132.4     74.57            no

1) My tree has only two nodes and here is why

>summary(tree_model)
Classification tree:
tree(formula = new_ProductName ~ ., data = training_data)
Variables actually used in tree construction:
[1] "PRODUCT_SUB_LINE_DESCR"
Number of terminal nodes:  2 
Residual mean deviance:  0 = 0 / 41140 
Misclassification error rate: 0 = 0 / 41146

2) I did create a new data frame which has only factors with level less than 22 level. There is one factor with 25 levels, but the tree() does not give an error so I think the algorithm accepts 25 levels

>str(new_Dataset)
'data.frame':   51433 obs. of  7 variables:
 $ PRODUCT_SUB_LINE_DESCR: Factor w/ 3 levels "Handpieces","PRIVATE 
                             LABEL",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ MAJOR_CATEGORY_DESCR  : Factor w/ 25 levels "AIR ABRASION",..: 23 23 23 
                             23 21 21 21 23 23 23 ...
 $ CUST_REGION_DESCR     : Factor w/ 7 levels "MOUNTAIN WEST REGION",..: 3 
                             6 6 3 5 6 6 2 1 1 ...
 $ Sales                 : num  210 -76.2 275.6 138.7 226 ...
 $ QtySold               : int  3 -1 2 1 2 1 2 1 1 2 ...
 $ MFGCOST               : num  134.6 -44.9 162.5 81.2 136 ...
 $ MarginDollars         : num  72.7 -30.4 109.8 55.8 87.3 ...

3) Here is how I set up my analysis

 # I choose product name as my main attribute(maybe that is why it appears at 
 the root node?)
 new_ProductName = ifelse( PRODUCT_SUB_LINE_DESCR == "PRIVATE 
                              LABEL","yes","no")
 data = data.frame(new_Dataset, new_ProductName)
 set.seed(100)
 train = sample(1:nrow(data), 0.8*nrow(data)) # training row indices
 training_data = data[train,] # training data
 testing_data = data[-train,] # testing data

 #fit the tree model using training data
 tree_model = tree(new_ProductName ~.,data = training_data)
 summary(tree_model)
 plot(tree_model)
 text(tree_model, pretty = 0)
 out = predict(tree_model) # predict the training data
 # actuals
 input.newproduct = as.character(training_data$new_ProductName)
 # predicted
 pred.newproduct = colnames(out)[max.col(out,ties.method = c("first"))] 
 mean (input.newproduct != pred.newproduct) # misclassification % 

# Cross Validation to see how much we need to prune the tree
set.seed(400)
cv_Tree = cv.tree(tree_model, FUN = prune.misclass) # run cross validation
attach(cv_Tree)
plot(cv_Tree) # plot the CV
plot(size, dev, type = "b")
# set size corresponding to lowest value in the plot above.
treePruneMod = prune.misclass(tree_model, best = 9) plot(treePruneMod)
text(treePruneMod, pretty = 0)
out = predict(treePruneMod) # fit the pruned tree
# Predicted
pred.newproduct = colnames(out)[max.col(out,ties.method = c("random"))] 
# calculate Mis-classification error
mean(training_data$new_ProductName != pred.newproduct)
# Predict testData with Pruned tree
out = predict(treePruneMod, testing_data, type = "class")

4) I have never done this before. I watched a couple of youtube videos and started to do this. I welcome great advice, explanation, criticism and please help me through this process. This has been challenging to me.

> table(data$PRODUCT_SUB_LINE_DESCR, data$new_ProductName)

                  no      yes
  Handpieces      164     0
  PRIVATE LABEL   0       14802
  SUNDRY          36467    0

Mankind_2000 · Accepted Answer

In short, Trees works by finding a variable which gives the best* split (i.e. differentiate between two classes) at every node and terminates for a branch if its pure.

For your problem, the algorithm evaluates that "PRODUCT_SUB_LINE_DESCR" is the best variable to split on and produces pure branches at either side, so no further split required.

This is due to how you defined your classes and your intuition is somewhat right:

# I choose product name as my main attribute (maybe that is why it appears at 
# the root node?)
new_ProductName = ifelse( PRODUCT_SUB_LINE_DESCR == "PRIVATE 
                          LABEL","yes","no")

By the above code/ rule you defined your classes and the best split at the same time. At this point tree based classification is equivalent to a simple rule based classification. Not a good idea.

You should ponder upon what you want to achieve first. If you want to predict the product name given other attributes. Then drop the "PRODUCT_SUB_LINE_DESCR" column after creating your classes (i.e. "new_ProductName") from dataframe and then run the tree classification.

*Note: Best split is based on information gain or gini index.

Decision Tree Issue: Why does tree() not pick all variables for the nodes

Answers (1)

Related Questions