Nahkki
Nahkki

Reputation: 872

How to weight classes in a RandomForest implementation?

I am working on 3D point identification using the RandomForest method from scikit. One of the issues I keep running into is that certain classes are present more often then other classes.

This means that in the process of generating predictions from the trained classifier, if the classifier is uncertain of a point class it will more likely assume it belongs to one of the common classes rather than the less common class.

I see that in the scikit documentation for random forests there is a sample_weight parameter in the fit method. From what I can tell that just weights certain overall samples(say I have 50 files I am training from, it will weight the first sample twice as heavily as everything else) rather than classes.

This doesn't fix the issue because the least common classes are about as rare in all the samples I have. It's just the nature of that particular class.

I've found some papers on balanced random forests and weighted random forests. But I haven't seen anything about how to use this in scikit. I'm hoping I'm wrong - is there a way to weight classes built in? Should I write something separate that artificially evens up the weight of different classes in my samples?

Sample_weight, according to the documentation, seems to be referring to samples and not class weight. So if I have files A, B and C and classes 1, 2 and 3, let's say:

A = [1 1 1 2]
B = [2 2 1 1]
C = [3 1 1 1]

Looking above we have a situation, very simplified, in which we have very few of class 3 compared to the other classes. My situation has 8 classes and is training on millions of points but the ratio is still incredibly skewed against two particular classes.

Using the sample_weight, which takes in an array of size m(m being the number of samples), I would be able to weight how heavily any of those three files work. So my understanding is that I can do a sample_weight = [1 1 2] which would make the sample C be twice as strong as the other two samples.

However, this doesn't really help because my issue is that the class 3 is super rare(in the actual data it's 1k points out of millions rather than 1 out of 12).

Increasing the weight of any given sample won't increase the weight of particular classes unless I fake some data in which the sample is composed of almost nothing but that particular class.

I found sklearn.preprocessing.balance_weights(y) in the documentation but I can't find anyone using it. In theory it does what I need it to do but I don't see how I can fit the weights array back into my Random Forest.

Upvotes: 7

Views: 8607

Answers (2)

Aroca
Aroca

Reputation: 11

I wonder if it would give better results using "balance_subsample" instead of "balanced"

rf = RandomForestClassifier(class_weight="balanced_subsample")

This option computes the weights dynamically based on the Boostrap sample you are taking to build the tree each time, so it would adjust the weights inside each sample set. If we consider that each sample set could be unbalanced in a different way, I would say this option should be the best one. Try this.

Upvotes: 0

David Maust
David Maust

Reputation: 8270

I'm guessing this only applies for the newer version of scikit-learn, but you can now use this.

rf = RandomForestClassifier(class_weight="balanced")

Upvotes: 2

Related Questions