parallel generation of random forests using scikit-learn

Question

Main question: How do I combine different randomForests in python and scikit-learn?

I am currently using the randomForest package in R to generate randomforest objects using elastic map reduce. This is to address a classification problem.

Since my input data is too large to fit in memory on one machine, I sample the data into smaller data sets and generate random forest object which contains a smaller set of trees. I then combine the different trees together using a modified combine function to create a new random forest object. This random forest object contains the feature importance and final set of trees. This does not include the oob errors or votes of the trees.

While this works well in R, I want to do the same thing in Python using scikit-learn. I can create different random forest objects, but I don't have any way to combine them together to form a new object. Can anyone point me to a function that can combine the forests? Is this possible using scikit-learn?

Here is the link to a question on how to this process in R:Combining random forests built with different training sets in R .

Edit: The resulting random forest object should contain the trees that can be used for prediction and also the feature importance.

Any help would be appreciated.

ogrisel · Accepted Answer

Sure, just aggregate all the trees, for instance have look at this snippet from pyrallel:

def combine(all_ensembles):
    """Combine the sub-estimators of a group of ensembles

        >>> from sklearn.datasets import load_iris
        >>> from sklearn.ensemble import ExtraTreesClassifier
        >>> iris = load_iris()
        >>> X, y = iris.data, iris.target

        >>> all_ensembles = [ExtraTreesClassifier(n_estimators=4).fit(X, y)
        ...                  for i in range(3)]
        >>> big = combine(all_ensembles)
        >>> len(big.estimators_)
        12
        >>> big.n_estimators
        12
        >>> big.score(X, y)
        1.0

    """
    final_ensemble = copy(all_ensembles[0])
    final_ensemble.estimators_ = []

    for ensemble in all_ensembles:
        final_ensemble.estimators_ += ensemble.estimators_

    # Required in old versions of sklearn
    final_ensemble.n_estimators = len(final_ensemble.estimators_)

    return final_ensemble

parallel generation of random forests using scikit-learn

Answers (2)

Related Questions