DaxHR
DaxHR

Reputation: 683

Change threshold value for Random Forest classifier

I need to develop a model which will be free (or close to free) of false negative values. To do so I've plotted Recall-Precision curve and determined that the threshold value should be set to 0.11

My question is, how to define threshold value upon model training? There's no point in defining it later upon evaluation because it won't reflect on new data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

rfc_model = RandomForestClassifier(random_state=101)
rfc_model.fit(X_train, y_train)
rfc_preds = rfc_model.predict(X_test)


recall_precision_vals = []

for val in np.linspace(0, 1, 101):
    predicted_proba = rfc_model.predict_proba(X_test)
    predicted = (predicted_proba[:, 1] >= val).astype('int')
    
    recall_sc = recall_score(y_test, predicted)
    precis_sc = precision_score(y_test, predicted)

    recall_precision_vals.append({
        'Threshold': val,
        'Recall val': recall_sc,
        'Precis val': precis_sc
    })


recall_prec_df = pd.DataFrame(recall_precision_vals)

Any ideas?

Upvotes: 5

Views: 6644

Answers (1)

desertnaut
desertnaut

Reputation: 60321

how to define threshold value upon model training?

There is simply no threshold during model training; Random Forest is a probabilistic classifier, and it only outputs class probabilities. "Hard" classes (i.e. 0/1), which indeed require a threshold, are neither produced nor used in any stage of the model training - only during prediction, and even then only in the cases we indeed require a hard classification (not always the case). Please see Predict classes or class probabilities? for more details.

Actually, the scikit-learn implementation of RF doesn't actually employ a threshold at all, even for hard class prediction; reading closely the docs for the predict method:

the predicted class is the one with highest mean probability estimate across the trees

In simple words, this means that the actual RF output is [p0, p1] (assuming binary classification), from which the predict method simply returns the class with the highest value, i.e. 0 if p0 > p1 and 1 otherwise.

Assuming that what you actually want to do is return 1 if p1 is greater from some threshold less than 0.5, you have to ditch predict, use predict_proba instead, and then manipulate these returned probabilities to get what you want. Here is an example with dummy data:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4,
                          n_informative=2, n_redundant=0,
                           n_classes=2, random_state=0, shuffle=False)

clf = RandomForestClassifier(n_estimators=100, max_depth=2,
                            random_state=0)

clf.fit(X, y)

Here, simply using predict for, say, the first element of X, will give 0:

clf.predict(X)[0] 
# 0

because

clf.predict_proba(X)[0]
# array([0.85266881, 0.14733119])

i.e. p0 > p1.

To get what you want (i.e. here returning class 1, since p1 > threshold for a threshold of 0.11), here is what you have to do:

prob_preds = clf.predict_proba(X)
threshold = 0.11 # define threshold here
preds = [1 if prob_preds[i][1]> threshold else 0 for i in range(len(prob_preds))]

after which, it is easy to see that now for the first predicted sample we have:

preds[0]
# 1

since, as shown above, for this sample we have p1 = 0.14733119 > threshold.

Upvotes: 12

Related Questions