Reputation: 532
I'm trying to use Random Forest to predict the outcome of an extremely imbalanced data set (the 1's rate is about only 1% or even less). Because the traditinal randomForest minimize the overall error rate, rather than paying special attention to the positive class, it makes the traditional randomForest not applicable for the imbalanced data. So I want to assigne a high cost to misclassification of the minority class(cost sensitive learning).
I read several sources that we can use the option classwt
of randomForest
on R, but I don't know how to use this. And do we have any other alternatives to the randomForest
funtion?
Upvotes: 0
Views: 3718
Reputation: 353
classwt
gives you the ability to assign a prior probability to each of the classes in your dataset. So, if you have classwt = c(0.5, 0.5)
, then you are saying that before actually running the model for your specific dataset, you expect there to be around the same number of 0's as 1's. You can adjust these parameters as you like to try to minimize false negatives. This may seem like a smart idea to assign a cost in theory, but in reality, does not work so well. The prior probabilities tend to affect the algorithm more sharply than desired. Still, you could play around with this.
An alternative solution is to run the regular random forest, and then for a prediction, use the type='prob'
option in the predict()
command. For instance, for a random forest rf1
, where we are trying to predict the results of a dataset data1
, we could do:
predictions <- predict(rf1, data=data1, type='prob')
Then, you can choose your own probability threshold for classifying the observations of your data. A nice way to graphically view which threshold may be desirable is to use the ROCR package, which generates receiver operator curve.
Upvotes: 3