Reputation: 197

Weird results with the randomForest R package

I have a data frame with 10,000 rows and two columns, segment (a factor with 32 values) and target (a factor with two values, 'yes' and 'no', 5,000 of each). I am trying to use a random forest to classify target using segment as a feature.

After training the random forest classifier:

> forest <- randomForest(target ~ segment, data)

The confusion matrix is strongly biased toward 'no':

> print(forest$confusion)

      no yes class.error
no  4872  76  0.01535974
yes 5033  19  0.99623911

Out of the 10,000 rows, less than a 100 got classified as 'yes' (even though original counts are 50/50). If I switch the names of the labels, I get the opposite result:

> data$target <- as.factor(ifelse(data$target == 'yes', 'no', 'yes'))
> forest <- randomForest(target ~ segment, data = data)
> print(forest$confusion)

      no yes class.error
no  4915 137  0.02711797
yes 4810 138  0.97210994

So this is not a real signal ... furthermore, the original cross-table is relatively balanced:

> table(data$target, data$segment)
 
         1   10   11   12   13   14   15   16   17   18   19    2   20   21   22   23   24   25   26   27   28   29    3   30   31   32    4    5    6    7    8    9
  no  1074  113  121   86   68  165  210   70  120  127  101  132   90  108  171  122   95   95   76   72  105   71  234   58   83   72  290  162  262  192   64  139
  yes 1114  105  136  120   73  201  209   78  130  124   90  145   81  104  155  128   79   85   83   70   93   78  266   70   93   76  291  160  235  194   49  137

It looks like randomForest takes the first label and almost always assigns points to it. To clarify, the data frame is a subset of a larger table with more features - I just found out that this specific feature somehow leads to this result, no matter how many other features are included. I am wondering whether I am missing something basic about the random forest classifier, or whether there is some encoding issue or other bug that leads to this weird result.

The original dataset is available as an RDS here:

https://www.dropbox.com/s/rjq6lmvd78d6aot/weird_random_forest.RDS?dl=0

Upvotes: 1

Answers (2)

KDA

Reputation: 311

I believe that the reason randomForest is almost always choosing 'no' when segment is a factor is because randomForest will produce distorted error rates, sensitivity and specificity when there is any inequality in the outcome class sizes. So, while your data are 'relatively' balanced, they are not entirely balanced; whichever outcome class is most prevalent in the dataset will be strongly favored in prediction. If you send balanced data to randomForest() when there's no true relationship between predictor and outcome, you will get more random fluctuation in the predicted class.

See Malley, et al. Statistical Learning for Biomedical Data. 2011. Cambridge University Press for more complete discussion of data balancing when using randomForest classification.

# create dataset balanced on outcome, random predictor values
data<-data.frame(target=rep(c("yes","no"),each=50), segment=sample(1:5, 100, replace=T))
table(data$target, data$segment)
table(data$target)
data$segment<- as.factor(data$segment)
forest_run1 <- randomForest(target ~ segment, data=data)
        #OOB estimate of  error rate: 46%
#Confusion matrix:
    #no yes class.error
    #no yes class.error
#no  25  25        0.50
#yes 21  29        0.42

forest_run2 <- randomForest(target ~ segment, data=data)
        #OOB estimate of  error rate: 53%
#Confusion matrix:
    #no yes class.error
#no  25  25        0.50
#yes 28  22        0.56

forest_run3 <- randomForest(target ~ segment, data=data)
        #OOB estimate of  error rate: 47%
#Confusion matrix:
    #no yes class.error
#no  25  25        0.50
#yes 22  28        0.44

# COMPARE THIS TO UNBALANCED RESULTS, WHERE MORE PREVALENT CLASS ALMOST ALWAYS CHOSEN
# create dataset, unbalanced on outcome, random predictor values:
data1<-data.frame(target=sample(c("yes","no"),50, replace=T,prob=c(0.6,0.4)), segment=sample(1:5, 100, replace=T))
table(data1$target, data1$segment)
table(data1$target)

forest1 <- randomForest(target ~ segment, data=data1)
        #OOB estimate of  error rate: 38%
#Confusion matrix:
    #no yes class.error
#no  14  30   0.6818182
#yes  8  48   0.1428571

Upvotes: 1

eipi10

Reputation: 93871

Your data frame is balanced in the sense that "yes" and "no" are about equally likely overall. However, the value of segment contains essentially no information about the value of target in the sense that "yes" and "no" are about equally likely for all levels of segment, so there's no reason to expect good predictions from random forest or any other procedure.

If you convert segment to numeric then randomForest predicts "yes" about 65% of the time. About 63% of the data is in values of segment where "yes" is (slightly) more probable than "no", so that may explain the high rate of "yes" predictions when segment is numeric. But whether segment is numeric or factor, the overall error rate is about the same. I'm not sure why randomForest is almost always choosing "no" when segment is a factor.

Upvotes: 1

Weird results with the randomForest R package

Answers (2)

Related Questions