Reputation: 339
I am working on a project at my work place and I am running into some issues with my decision tree analysis. THIS IS NOT A HOMEWORK ASSIGNMENT. Sample dataset
PRODUCT_SUB_LINE_DESCR MAJOR_CATEGORY_DESCR CUST_REGION_DESCR
SUNDRY SMALL EQUIP NORTH EAST REGION
SUNDRY SMALL EQUIP SOUTH EAST REGION
SUNDRY SMALL EQUIP SOUTH EAST REGION
SUNDRY SMALL EQUIP NORTH EAST REGION
SUNDRY PREVENTIVE SOUTH CENTRAL REGION
SUNDRY PREVENTIVE SOUTH EAST REGION
SUNDRY PREVENTIVE SOUTH EAST REGION
SUNDRY SMALL EQUIP NORTH CENTRAL REGION
SUNDRY SMALL EQUIP MOUNTAIN WEST REGION
SUNDRY SMALL EQUIP MOUNTAIN WEST REGION
SUNDRY COMPOSITE NORTH CENTRAL REGION
SUNDRY COMPOSITE NORTH CENTRAL REGION
SUNDRY COMPOSITE OHIO VALLEY REGION
SUNDRY COMPOSITE NORTH EAST REGION
Sales QtySold MFGCOST MarginDollars new_ProductName
209.97 3 134.55 72.72 no
-76.15 -1 -44.85 -30.4 no
275.6 2 162.5 109.84 no
138.7 1 81.25 55.82 no
226 2 136 87.28 no
115 1 68 45.64 no
210.7 2 136 71.98 no
29 1 18.85 9.77 no
29 1 18.85 9.77 no
46.32 2 37.7 7.86 no
159.86 1 132.4 24.81 no
441.3 2 264.8 171.2 no
209.62 1 132.4 74.57 no
209.62 1 132.4 74.57 no
1) My tree has only two nodes and here is why
>summary(tree_model)
Classification tree:
tree(formula = new_ProductName ~ ., data = training_data)
Variables actually used in tree construction:
[1] "PRODUCT_SUB_LINE_DESCR"
Number of terminal nodes: 2
Residual mean deviance: 0 = 0 / 41140
Misclassification error rate: 0 = 0 / 41146
2) I did create a new data frame which has only factors with level less than 22 level. There is one factor with 25 levels, but the tree() does not give an error so I think the algorithm accepts 25 levels
>str(new_Dataset)
'data.frame': 51433 obs. of 7 variables:
$ PRODUCT_SUB_LINE_DESCR: Factor w/ 3 levels "Handpieces","PRIVATE
LABEL",..: 3 3 3 3 3 3 3 3 3 3 ...
$ MAJOR_CATEGORY_DESCR : Factor w/ 25 levels "AIR ABRASION",..: 23 23 23
23 21 21 21 23 23 23 ...
$ CUST_REGION_DESCR : Factor w/ 7 levels "MOUNTAIN WEST REGION",..: 3
6 6 3 5 6 6 2 1 1 ...
$ Sales : num 210 -76.2 275.6 138.7 226 ...
$ QtySold : int 3 -1 2 1 2 1 2 1 1 2 ...
$ MFGCOST : num 134.6 -44.9 162.5 81.2 136 ...
$ MarginDollars : num 72.7 -30.4 109.8 55.8 87.3 ...
3) Here is how I set up my analysis
# I choose product name as my main attribute(maybe that is why it appears at
the root node?)
new_ProductName = ifelse( PRODUCT_SUB_LINE_DESCR == "PRIVATE
LABEL","yes","no")
data = data.frame(new_Dataset, new_ProductName)
set.seed(100)
train = sample(1:nrow(data), 0.8*nrow(data)) # training row indices
training_data = data[train,] # training data
testing_data = data[-train,] # testing data
#fit the tree model using training data
tree_model = tree(new_ProductName ~.,data = training_data)
summary(tree_model)
plot(tree_model)
text(tree_model, pretty = 0)
out = predict(tree_model) # predict the training data
# actuals
input.newproduct = as.character(training_data$new_ProductName)
# predicted
pred.newproduct = colnames(out)[max.col(out,ties.method = c("first"))]
mean (input.newproduct != pred.newproduct) # misclassification %
# Cross Validation to see how much we need to prune the tree
set.seed(400)
cv_Tree = cv.tree(tree_model, FUN = prune.misclass) # run cross validation
attach(cv_Tree)
plot(cv_Tree) # plot the CV
plot(size, dev, type = "b")
# set size corresponding to lowest value in the plot above.
treePruneMod = prune.misclass(tree_model, best = 9) plot(treePruneMod)
text(treePruneMod, pretty = 0)
out = predict(treePruneMod) # fit the pruned tree
# Predicted
pred.newproduct = colnames(out)[max.col(out,ties.method = c("random"))]
# calculate Mis-classification error
mean(training_data$new_ProductName != pred.newproduct)
# Predict testData with Pruned tree
out = predict(treePruneMod, testing_data, type = "class")
4) I have never done this before. I watched a couple of youtube videos and started to do this. I welcome great advice, explanation, criticism and please help me through this process. This has been challenging to me.
> table(data$PRODUCT_SUB_LINE_DESCR, data$new_ProductName)
no yes
Handpieces 164 0
PRIVATE LABEL 0 14802
SUNDRY 36467 0
Upvotes: 1
Views: 5193
Reputation: 2218
In short, Trees works by finding a variable which gives the best* split (i.e. differentiate between two classes) at every node and terminates for a branch if its pure.
For your problem, the algorithm evaluates that "PRODUCT_SUB_LINE_DESCR" is the best variable to split on and produces pure branches at either side, so no further split required.
This is due to how you defined your classes and your intuition is somewhat right:
# I choose product name as my main attribute (maybe that is why it appears at
# the root node?)
new_ProductName = ifelse( PRODUCT_SUB_LINE_DESCR == "PRIVATE
LABEL","yes","no")
By the above code/ rule you defined your classes and the best split at the same time. At this point tree based classification is equivalent to a simple rule based classification. Not a good idea.
You should ponder upon what you want to achieve first. If you want to predict the product name given other attributes. Then drop the "PRODUCT_SUB_LINE_DESCR" column after creating your classes (i.e. "new_ProductName") from dataframe and then run the tree classification.
*Note: Best split is based on information gain or gini index.
Upvotes: 1