Reputation: 683
I need to develop a model which will be free (or close to free) of false negative values. To do so I've plotted Recall-Precision curve and determined that the threshold value should be set to 0.11
My question is, how to define threshold value upon model training? There's no point in defining it later upon evaluation because it won't reflect on new data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
rfc_model = RandomForestClassifier(random_state=101)
rfc_model.fit(X_train, y_train)
rfc_preds = rfc_model.predict(X_test)
recall_precision_vals = []
for val in np.linspace(0, 1, 101):
predicted_proba = rfc_model.predict_proba(X_test)
predicted = (predicted_proba[:, 1] >= val).astype('int')
recall_sc = recall_score(y_test, predicted)
precis_sc = precision_score(y_test, predicted)
recall_precision_vals.append({
'Threshold': val,
'Recall val': recall_sc,
'Precis val': precis_sc
})
recall_prec_df = pd.DataFrame(recall_precision_vals)
Any ideas?
Upvotes: 5
Views: 6644
Reputation: 60321
how to define threshold value upon model training?
There is simply no threshold during model training; Random Forest is a probabilistic classifier, and it only outputs class probabilities. "Hard" classes (i.e. 0/1), which indeed require a threshold, are neither produced nor used in any stage of the model training - only during prediction, and even then only in the cases we indeed require a hard classification (not always the case). Please see Predict classes or class probabilities? for more details.
Actually, the scikit-learn implementation of RF doesn't actually employ a threshold at all, even for hard class prediction; reading closely the docs for the predict
method:
the predicted class is the one with highest mean probability estimate across the trees
In simple words, this means that the actual RF output is [p0, p1]
(assuming binary classification), from which the predict
method simply returns the class with the highest value, i.e. 0 if p0 > p1
and 1 otherwise.
Assuming that what you actually want to do is return 1 if p1
is greater from some threshold less than 0.5, you have to ditch predict
, use predict_proba
instead, and then manipulate these returned probabilities to get what you want. Here is an example with dummy data:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0,
n_classes=2, random_state=0, shuffle=False)
clf = RandomForestClassifier(n_estimators=100, max_depth=2,
random_state=0)
clf.fit(X, y)
Here, simply using predict
for, say, the first element of X
, will give 0:
clf.predict(X)[0]
# 0
because
clf.predict_proba(X)[0]
# array([0.85266881, 0.14733119])
i.e. p0 > p1
.
To get what you want (i.e. here returning class 1, since p1 > threshold
for a threshold of 0.11), here is what you have to do:
prob_preds = clf.predict_proba(X)
threshold = 0.11 # define threshold here
preds = [1 if prob_preds[i][1]> threshold else 0 for i in range(len(prob_preds))]
after which, it is easy to see that now for the first predicted sample we have:
preds[0]
# 1
since, as shown above, for this sample we have p1 = 0.14733119 > threshold
.
Upvotes: 12