Spark ML random forest and gradient-boosted trees for regression

Question

According to Spark ML docs random forest and gradient-boosted trees can be used for both: classification and regression problems:

https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-regression

Suppose my "label" is taking integer values from 0..n and I want to train these classifiers for regression problem, predicting continuous variable value for the label field. However, I don't see in the documentation how both of these regressors should be configured for this problem and I don't see any class parameters which distinguish cases for regression vs classification. How both classifiers should be configured for regression problems, then?

desertnaut · Accepted Answer

There is no such configuration involved, simply because the regression & classification problems are actually handled by different submodules & classes in Spark ML; i.e. for classification, you should use (assuming PySpark):

from pyspark.ml.classification import GBTClassifier  # GBT
from pyspark.ml.classification import RandomForestClassifier  # RF

while for regression you should use respectively

from pyspark.ml.regression import GBTRegressor  # GBT
from pyspark.ml.regression import RandomForestRegressor  # RF

Check the Classification and regression overview in the docs for more details.

Spark ML random forest and gradient-boosted trees for regression

Answers (1)

Related Questions