Reputation: 15777
With a supervised learning method, we have features (inputs) and targets (outputs). If we have multi-dimensional targets that sum to 1 row-wise (e.g [0.3, 0.4, 0.3]) why does sklearn's RandomForestRegressor seem to normalize all outputs/predictions to sum to 1 when the training data sums to 1?
It seems like somewhere in the source code of sklearn it is normalizing outputs if the training data sums to 1, but I haven't been able to find it. I've gotten to the BaseDecisionTree
class which seems to be used by random forests, but haven't been able to see any normalization going on it there. I created a gist to show how it works. When the row-wise sums of the targets don't sum to 1, the outputs of the regressor do not sum to 1. But when the row-wise sums of the targets DO sum to 1, it seems to normalize it. Here is the demonstration code from the gist:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
# simulate data
# 12 rows train, 6 rows test, 5 features, 3 columns for target
features = np.random.random((12, 5))
targets = np.random.random((12, 3))
test_features = np.random.random((6, 5))
rfr = RandomForestRegressor(random_state=42)
rfr.fit(features, targets)
preds = rfr.predict(features)
print('preds sum to 1?')
print(np.allclose(preds.sum(axis=1), np.ones(12)))
# normalize targets to sum to 1
norm_targets = targets / targets.sum(axis=1, keepdims=1)
rfr.fit(features, norm_targets)
preds = rfr.predict(features)
te_preds = rfr.predict(test_features)
print('predictions all sum to 1?')
print(np.allclose(preds.sum(axis=1), np.ones(12)))
print('test predictions all sum to 1?')
print(np.allclose(te_preds.sum(axis=1), np.ones(6)))
As one last note, I tried running a comparable fit in other random forest implementations (H2O in Python, in R: rpart, Rborist, RandomForest) but didn't find another implementation that allows multiple outputs.
My guess is that there is a bug in the sklearn code which is mixing up classification and regression somehow, and the outputs are being normalized to 1 like a classification problem.
Upvotes: 1
Views: 648
Reputation: 88236
What can be misleading here, is that you are only looking at the resulting sum
of the output values. The reason why all predictions add up to 1 when the model is trained with the normalized labels, is that it will be predicting only among these multi-output arrays that it has seen. And this is happening because with such few samples, the model is overfitting, and the decision tree is de facto acting like a classifier.
In other words, looking at the example where the output is not normalised (the same applies to a DecisionTree
):
from sklearn.tree import DecisionTreeRegressor
features = np.random.random((6, 5))
targets = np.random.random((6, 3))
rfr = DecisionTreeRegressor(random_state=42)
rfr.fit(features, targets)
If we now predict on a new set of random features, we will be getting predictions among the set of outputs the model has been trained on:
features2 = np.random.random((6, 5))
preds = rfr.predict(features2)
print(preds)
array([[0.0017143 , 0.05348525, 0.60877828], #0
[0.05232433, 0.37249988, 0.27844562], #1
[0.08177551, 0.39454957, 0.28182183],
[0.05232433, 0.37249988, 0.27844562],
[0.08177551, 0.39454957, 0.28182183],
[0.80068346, 0.577799 , 0.66706668]])
print(targets)
array([[0.80068346, 0.577799 , 0.66706668],
[0.0017143 , 0.05348525, 0.60877828], #0
[0.08177551, 0.39454957, 0.28182183],
[0.75093787, 0.29467892, 0.11253746],
[0.87035059, 0.32162589, 0.57288903],
[0.05232433, 0.37249988, 0.27844562]]) #1
So logically, if all training outputs add up to 1
, the same will apply to the predicted values.
If we take the intersection of the sum
s along the first axis for both the targets and predicted values, we see that all predicted values' sum exists in targets
:
preds_sum = np.unique(preds.sum(1))
targets_sum = np.unique(targets.sum(1))
len(np.intersect1d(targets_sum, preds_sum)) == len(features)
# True
Upvotes: 2