Reputation: 334
I am trying to fit a random forest classifier on an imbalanced dataset using the scikit-learn Python library.
My goal is to obtain more or less the same value for recall and precision, and to do so, I am using the class_weight parameter of the RandomForestClassifier function.
When fitting the Random Forest with a class_weight = {0:1, 1:1}, (in other words, assuming the dataset is not imbalanced), I obtain:
Accuracy: 0.79 Precision: 0.63 Recall: 0.32 AUC: 0.74
When I change the class_weight to {0:1, 1:10}, I obtain:
Accuracy: 0.79 Precision: 0.65 Recall: 0.29 AUC: 0.74
So, the recall and precision values almost have not changed (even if I increase from 10 to 100, the changes are minimal).
Since the X_train and X_test are both imbalanced in the same proportions (the dataset has more than 1 million rows), shouldn't I obtain very different recall and precision values when using the class_weight = {0:1, 1:10}?
Upvotes: 1
Views: 495
Reputation: 1984
As a complementary answer, you can also try to optimize your model towards one or more metrics. You can use RandomizedSearchCV to look for a good combination of hyperparameters for you. For instance, if you training a Random Forest classifier":
#model
MOD = RandomForestClassifier()
#Implemente RandomSearchCV
m_params = {
"RF": {
"n_estimators" : np.linspace(2, 500, 500, dtype = "int"),
"max_depth": [5, 20, 30, None],
"min_samples_split": np.linspace(2, 50, 50, dtype = "int"),
"max_features": ["sqrt", "log2",10, 20, None],
"oob_score": [True],
"bootstrap": [True]
},
}
scoreFunction = {"recall": "recall", "precision": "precision"}
random_search = RandomizedSearchCV(MOD,
param_distributions = m_params[model],
n_iter = 20,
scoring = scoreFunction,
refit = "recall",
return_train_score = True,
random_state = 42,
cv = 5,
verbose = 1 + int(log))
#trains and optimizes the model
random_search.fit(x_train, y_train)
#recover the best model
MOD = random_search.best_estimator_
Note that the parameters scoring and refit will tell the RandomizedSerachCV which metrics you are most interested in maximizing. This method will also save you the time of hand tuning (and potentially overfitting your model on your test data).
Good luck!
Upvotes: 0
Reputation: 1514
If you want to increase the recall of your model there is a much faster way of doing so.
You can compute the precision recall curve using sklearn.
This curve will give you the trade-off between precision and recall for your model.
This means, if you want to increase your recall of your model, you could ask the random forest to retrieve you the probabilities for each class, add 0.1 to class 1 and subtract 0.1 to the probability of class 0. This will effectively increase your recall
If you plot the precision recall curve you will be able to find the optimal threshold for equal precision and recall
Here you have the example from sklearn
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Add noisy features
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
# Limit to the two first classes, and split into training and test
X_train, X_test, y_train, y_test = train_test_split(X[y < 2], y[y < 2],
test_size=.5,
random_state=random_state)
# Create a simple classifier
classifier = svm.LinearSVC(random_state=random_state)
classifier.fit(X_train, y_train)
y_score = classifier.decision_function(X_test)
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
from sklearn.utils.fixes import signature
precision, recall, _ = precision_recall_curve(y_test, y_score)
# In matplotlib < 1.5, plt.fill_between does not have a 'step' argument
step_kwargs = ({'step': 'post'}
if 'step' in signature(plt.fill_between).parameters
else {})
plt.step(recall, precision, color='b', alpha=0.2,
where='post')
plt.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
Which should give you something like this
Upvotes: 2