Reputation: 197
I have a data frame with 10,000 rows and two columns, segment (a factor with 32 values) and target (a factor with two values, 'yes' and 'no', 5,000 of each). I am trying to use a random forest to classify target using segment as a feature.
After training the random forest classifier:
> forest <- randomForest(target ~ segment, data)
The confusion matrix is strongly biased toward 'no':
> print(forest$confusion)
no yes class.error
no 4872 76 0.01535974
yes 5033 19 0.99623911
Out of the 10,000 rows, less than a 100 got classified as 'yes' (even though original counts are 50/50). If I switch the names of the labels, I get the opposite result:
> data$target <- as.factor(ifelse(data$target == 'yes', 'no', 'yes'))
> forest <- randomForest(target ~ segment, data = data)
> print(forest$confusion)
no yes class.error
no 4915 137 0.02711797
yes 4810 138 0.97210994
So this is not a real signal ... furthermore, the original cross-table is relatively balanced:
> table(data$target, data$segment)
1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 27 28 29 3 30 31 32 4 5 6 7 8 9
no 1074 113 121 86 68 165 210 70 120 127 101 132 90 108 171 122 95 95 76 72 105 71 234 58 83 72 290 162 262 192 64 139
yes 1114 105 136 120 73 201 209 78 130 124 90 145 81 104 155 128 79 85 83 70 93 78 266 70 93 76 291 160 235 194 49 137
It looks like randomForest takes the first label and almost always assigns points to it. To clarify, the data frame is a subset of a larger table with more features - I just found out that this specific feature somehow leads to this result, no matter how many other features are included. I am wondering whether I am missing something basic about the random forest classifier, or whether there is some encoding issue or other bug that leads to this weird result.
The original dataset is available as an RDS here:
https://www.dropbox.com/s/rjq6lmvd78d6aot/weird_random_forest.RDS?dl=0
Upvotes: 1
Views: 1118
Reputation: 311
I believe that the reason randomForest is almost always choosing 'no' when segment is a factor is because randomForest will produce distorted error rates, sensitivity and specificity when there is any inequality in the outcome class sizes. So, while your data are 'relatively' balanced, they are not entirely balanced; whichever outcome class is most prevalent in the dataset will be strongly favored in prediction. If you send balanced data to randomForest() when there's no true relationship between predictor and outcome, you will get more random fluctuation in the predicted class.
See Malley, et al. Statistical Learning for Biomedical Data. 2011. Cambridge University Press for more complete discussion of data balancing when using randomForest classification.
# create dataset balanced on outcome, random predictor values
data<-data.frame(target=rep(c("yes","no"),each=50), segment=sample(1:5, 100, replace=T))
table(data$target, data$segment)
table(data$target)
data$segment<- as.factor(data$segment)
forest_run1 <- randomForest(target ~ segment, data=data)
#OOB estimate of error rate: 46%
#Confusion matrix:
#no yes class.error
#no yes class.error
#no 25 25 0.50
#yes 21 29 0.42
forest_run2 <- randomForest(target ~ segment, data=data)
#OOB estimate of error rate: 53%
#Confusion matrix:
#no yes class.error
#no 25 25 0.50
#yes 28 22 0.56
forest_run3 <- randomForest(target ~ segment, data=data)
#OOB estimate of error rate: 47%
#Confusion matrix:
#no yes class.error
#no 25 25 0.50
#yes 22 28 0.44
# COMPARE THIS TO UNBALANCED RESULTS, WHERE MORE PREVALENT CLASS ALMOST ALWAYS CHOSEN
# create dataset, unbalanced on outcome, random predictor values:
data1<-data.frame(target=sample(c("yes","no"),50, replace=T,prob=c(0.6,0.4)), segment=sample(1:5, 100, replace=T))
table(data1$target, data1$segment)
table(data1$target)
forest1 <- randomForest(target ~ segment, data=data1)
#OOB estimate of error rate: 38%
#Confusion matrix:
#no yes class.error
#no 14 30 0.6818182
#yes 8 48 0.1428571
Upvotes: 1
Reputation: 93871
Your data frame is balanced in the sense that "yes" and "no" are about equally likely overall. However, the value of segment
contains essentially no information about the value of target
in the sense that "yes" and "no" are about equally likely for all levels of segment
, so there's no reason to expect good predictions from random forest or any other procedure.
If you convert segment
to numeric then randomForest
predicts "yes" about 65% of the time. About 63% of the data is in values of segment
where "yes" is (slightly) more probable than "no", so that may explain the high rate of "yes" predictions when segment
is numeric. But whether segment
is numeric or factor, the overall error rate is about the same. I'm not sure why randomForest
is almost always choosing "no" when segment
is a factor.
Upvotes: 1