Reputation: 169
I'm now using RandomForestRegressor from sklearn.ensemble to analyze a dataset and I select "mse" as the function to measure the quality of a split. But I'm not quite clear how the mse is calculated. Could anyone explain it to me here (better with equations) or provide me some references on that? Thank you in advance.
Upvotes: 2
Views: 1839
Reputation: 1966
Short answer:
Mean Squared Error (MSE) is calculated by squaring all of the errors (to make them positive) and then taking the mean value of those squares. It is a single value; "for each tree, we get a difference between two MSE values. Averaging over trees gives the mean difference between the two MSE values". ref
... However, the Random Forest calculates the MSE using the predictions obtained from evaluating the same data. Train in every tree but only considering the data is not taken from bootstrapping to construct the tree, wether the data that it is in the OOB (OUT-OF-BAG). Then it averages the predictions for all the OOB predictions for each sample of data. ... ref
MSE, metric is one of the cost function methods. Consider that your model green line is in the following picture, and those blue points are data (observations). MSE, as its name suggests, is the mean summation of square areas of all data points with respect to a line, which all in all represents your model errors.
MSE can be calculated by:
It shows how good or bad the model is. The smaller MSE, the better model!
More info:
Understanding Regression Error Metrics in Python
Introduction to Loss Functions
Update 30.05.2019: To verify things, you can dig into the documentation and sometimes in codes as well, based on its documentation RandomForestRegressor() , MSE is nothing else than variance reduction as feature selection criterion even when you check the source code it is used for measuring the quality of a split. On the other hand, if you are doubtful instead of the MSE approach in RandomForestRegressor()
, you can use it independently by customizing criterion
like this:
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
#Feature Selection
criterion = mean_squared_error(y, predictions)
RandomForestRegressor( ...,criterion= criterion,...)
or using numpy:
import numpy as np
criterion = np.mean((y_test - est.predict(X_test))**2)
More info
Upvotes: 2