implementation of R random forest feature importance score in scikit-learn

Question

I'm trying to implement R's feature importance score method for random forest regression models in sklearn; according to R's the documentation:

The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).

Therefore, if I understand correctly, I need to be able to permute each predictor variable (feature) for OOB samples within each tree.

I understand that I can access each tree within a trained forest with something like this

numberTrees = 100
clf = RandomForestRegressor(n_estimators=numberTrees)
clf.fit(X,Y)
for tree in clf.estimators_:
    do something

Is there anyway of getting a list of samples that are OOB for each tree? Perhaps I can us the random_state of each tree to derive the list of OOB samples?

Constantino · Accepted Answer

Although R uses OOB samples, I've found that by using all the training samples, I get similar results in scikit. I'm doing the following:

# permute training data and score against its own model  
epoch = 3
seeds = range(epoch)


scores = defaultdict(list) # {feature: change in R^2}

# repeat process several times and then average and then average the score for each feature
for j in xrange(epoch):
    clf = RandomForestRegressor(n_jobs = -1, n_estimators = trees, random_state = seeds[j],
                               max_features = num_features, min_samples_leaf = leaf)

    clf = clf.fit(X_train, y_train)
    acc = clf.score(X_train, y_train)    

    print 'Epoch', j
    # for each feature, permute its values and check the resulting score
    for i, col in enumerate(X_train.columns):
        if i % 200 == 0: print "- feature %s of %s permuted" %(i, X_train.shape[1])
        X_train_copy = X_train.copy()
        X_train_copy[col] = np.random.permutation(X_train[col])
        shuff_acc = clf.score(X_train_copy, y_train)
        scores[col].append((acc-shuff_acc)/acc)

# get mean across epochs
scores_mean = {k: np.mean(v) for k, v in scores.iteritems()}

# sort scores (best first)
scores_sorted = pd.DataFrame.from_dict(scores_mean, orient='index').sort(0, ascending = False)

implementation of R random forest feature importance score in scikit-learn

Answers (1)

Related Questions