Machine learning Random forests

Question

I am trying to fit a random forest classifier on an imbalanced dataset using the scikit-learn Python library.

My goal is to obtain more or less the same value for recall and precision, and to do so, I am using the class_weight parameter of the RandomForestClassifier function.

When fitting the Random Forest with a class_weight = {0:1, 1:1}, (in other words, assuming the dataset is not imbalanced), I obtain:

Accuracy: 0.79 Precision: 0.63 Recall: 0.32 AUC: 0.74

When I change the class_weight to {0:1, 1:10}, I obtain:

Accuracy: 0.79 Precision: 0.65 Recall: 0.29 AUC: 0.74

So, the recall and precision values almost have not changed (even if I increase from 10 to 100, the changes are minimal).

Since the X_train and X_test are both imbalanced in the same proportions (the dataset has more than 1 million rows), shouldn't I obtain very different recall and precision values when using the class_weight = {0:1, 1:10}?

Gabriel M · Accepted Answer

If you want to increase the recall of your model there is a much faster way of doing so.

You can compute the precision recall curve using sklearn.

This curve will give you the trade-off between precision and recall for your model.

This means, if you want to increase your recall of your model, you could ask the random forest to retrieve you the probabilities for each class, add 0.1 to class 1 and subtract 0.1 to the probability of class 0. This will effectively increase your recall

If you plot the precision recall curve you will be able to find the optimal threshold for equal precision and recall

Here you have the example from sklearn

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

# Add noisy features
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]

# Limit to the two first classes, and split into training and test
X_train, X_test, y_train, y_test = train_test_split(X[y < 2], y[y < 2],
                                                    test_size=.5,
                                                    random_state=random_state)

# Create a simple classifier
classifier = svm.LinearSVC(random_state=random_state)
classifier.fit(X_train, y_train)
y_score = classifier.decision_function(X_test)

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
from sklearn.utils.fixes import signature

precision, recall, _ = precision_recall_curve(y_test, y_score)

# In matplotlib < 1.5, plt.fill_between does not have a 'step' argument
step_kwargs = ({'step': 'post'}
               if 'step' in signature(plt.fill_between).parameters
               else {})
plt.step(recall, precision, color='b', alpha=0.2,
         where='post')
plt.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])

Which should give you something like this

Machine learning Random forests

Answers (2)

Related Questions