reddy
reddy

Reputation: 190

parallel generation of random forests using scikit-learn

Main question: How do I combine different randomForests in python and scikit-learn?

I am currently using the randomForest package in R to generate randomforest objects using elastic map reduce. This is to address a classification problem.

Since my input data is too large to fit in memory on one machine, I sample the data into smaller data sets and generate random forest object which contains a smaller set of trees. I then combine the different trees together using a modified combine function to create a new random forest object. This random forest object contains the feature importance and final set of trees. This does not include the oob errors or votes of the trees.

While this works well in R, I want to do the same thing in Python using scikit-learn. I can create different random forest objects, but I don't have any way to combine them together to form a new object. Can anyone point me to a function that can combine the forests? Is this possible using scikit-learn?

Here is the link to a question on how to this process in R:Combining random forests built with different training sets in R .

Edit: The resulting random forest object should contain the trees that can be used for prediction and also the feature importance.

Any help would be appreciated.

Upvotes: 9

Views: 7379

Answers (2)

ogrisel
ogrisel

Reputation: 40169

Sure, just aggregate all the trees, for instance have look at this snippet from pyrallel:

def combine(all_ensembles):
    """Combine the sub-estimators of a group of ensembles

        >>> from sklearn.datasets import load_iris
        >>> from sklearn.ensemble import ExtraTreesClassifier
        >>> iris = load_iris()
        >>> X, y = iris.data, iris.target

        >>> all_ensembles = [ExtraTreesClassifier(n_estimators=4).fit(X, y)
        ...                  for i in range(3)]
        >>> big = combine(all_ensembles)
        >>> len(big.estimators_)
        12
        >>> big.n_estimators
        12
        >>> big.score(X, y)
        1.0

    """
    final_ensemble = copy(all_ensembles[0])
    final_ensemble.estimators_ = []

    for ensemble in all_ensembles:
        final_ensemble.estimators_ += ensemble.estimators_

    # Required in old versions of sklearn
    final_ensemble.n_estimators = len(final_ensemble.estimators_)

    return final_ensemble

Upvotes: 8

David
David

Reputation: 9405

Based on your edit, it sounds like you're only asking for how to extract feature importance and look at the individual trees used in a random forest. If so, both of these are attributes of your random forest model named "feature_importances_" and "estimators_" respecitively. An example illustrating this can be found below:

>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.datasets import make_blobs
>>> X, y = make_blobs(n_samples=10000, n_features=10, centers=100,random_state=0)
>>> clf = RandomForestClassifier(n_estimators=5, max_depth=None, min_samples_split=1, random_state=0)
>>> clf.fit(X,y)
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            min_density=None, min_samples_leaf=1, min_samples_split=1,
            n_estimators=5, n_jobs=1, oob_score=False, random_state=0,
            verbose=0)
>>> clf.feature_importances_
array([ 0.09396245,  0.07052027,  0.09951226,  0.09095071,  0.08926362,
        0.112209  ,  0.09137607,  0.11771107,  0.11297425,  0.1215203 ])
>>> clf.estimators_
[DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features='auto', min_density=None,
            min_samples_leaf=1, min_samples_split=1,
            random_state=<mtrand.RandomState object at 0x2b6f62d9b408>,
            splitter='best'), DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features='auto', min_density=None,
            min_samples_leaf=1, min_samples_split=1,
            random_state=<mtrand.RandomState object at 0x2b6f62d9b3f0>,
            splitter='best'), DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features='auto', min_density=None,
            min_samples_leaf=1, min_samples_split=1,
            random_state=<mtrand.RandomState object at 0x2b6f62d9b420>,
            splitter='best'), DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features='auto', min_density=None,
            min_samples_leaf=1, min_samples_split=1,
            random_state=<mtrand.RandomState object at 0x2b6f62d9b438>,
            splitter='best'), DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=None, max_features='auto', min_density=None,
            min_samples_leaf=1, min_samples_split=1,
            random_state=<mtrand.RandomState object at 0x2b6f62d9b450>,
            splitter='best')]

Upvotes: 2

Related Questions