Patthebug
Patthebug

Reputation: 4787

Make RandomForestClassifier pick a variable for sure during training

This is a bit of a newbie question.

I'd like to train a Random Forest using a RandomForestClassifier from sklearn. I have a few variables, but out of these variables, I'd like the algorithm to pick a variable (let's call it SourceID) for sure in every single tree that it trains.

How do I do that? I don't see any paramters in the classifier that would help in this case.

Any help would be appreciated! TIA.

EDIT

So here's the scenario I have..

If a teacher assigns an assignment on Concept A, I have to predict the next possible assignment concept. The next assigned concept would be heavily dependant on Concept A which has already been assigned. For example - after assigning "Newton's first law of motion", there's a great possibility that "Newton's second law of motion" may be assigned. Quite often, the choice of concepts to be assigned after, say, Concept A, are limited. I'd like to predict the best possible option after Concept A has been assigned, given past data.

If I let the random forest do its job of picking variables at random, then there will be a few trees which will not have the variable for Concept A, in which case, the prediction may not make much sense, which is why I'd like to force this variable into selection. Better yet, it'd be great if this variable is chosen as the first variable in each tree to split on.

Does this make things clear? Is random forest not a candidate at all for this job?

Upvotes: 3

Views: 1796

Answers (2)

Matt Hancock
Matt Hancock

Reputation: 4039

There is no option for this in the RandomForestClassifier, but the random forest algorithm is just an ensemble of decision trees where each tree only considers a subset of all possible features and is trained on a bootstrap subsample of the training data.

So, it isn't too difficult to create this ourselves manually for trees that are forced to use a specific set of features. I've written a class to do this below. This does not perform robust input validation or anything like that, but you can consult the source of sklearn's random forest fit function for that. This is meant to give you a flavor of how to build it yourself:

FixedFeatureRFC.py

import numpy as np
from sklearn.tree import DecisionTreeClassifier

class FixedFeatureRFC:
    def __init__(self, n_estimators=10, random_state=None):
        self.n_estimators = n_estimators

        if random_state is None:
            self.random_state = np.random.RandomState()

    def fit(self, X, y, feats_fixed=None, max_features=None, bootstrap_frac=0.8):
        """
        feats_fixed: indices of features (columns of X) to be 
                     always used to train each estimator

        max_features: number of features that each estimator will use,
                      including the fixed features.

        bootstrap_frac: size of bootstrap sample that each estimator will use.
        """
        self.estimators = []
        self.feats_used = []
        self.n_classes  = np.unique(y).shape[0]

        if feats_fixed is None:
            feats_fixed = []
        if max_features is None:
            max_features = X.shape[1]

        n_samples = X.shape[0]
        n_bs = int(bootstrap_frac*n_samples)

        feats_fixed = list(feats_fixed)
        feats_all   = range(X.shape[1])

        random_choice_size = max_features - len(feats_fixed)

        feats_choosable = set(feats_all).difference(set(feats_fixed))
        feats_choosable = np.array(list(feats_choosable))

        for i in range(self.n_estimators):
            chosen = self.random_state.choice(feats_choosable,
                                              size=random_choice_size,
                                              replace=False)
            feats = feats_fixed + list(chosen)
            self.feats_used.append(feats)

            bs_sample = self.random_state.choice(n_samples,
                                                 size=n_bs,
                                                 replace=True)

            dtc = DecisionTreeClassifier(random_state=self.random_state)
            dtc.fit(X[bs_sample][:,feats], y[bs_sample])
            self.estimators.append(dtc)

    def predict_proba(self, X):
        out = np.zeros((X.shape[0], self.n_classes))
        for i in range(self.n_estimators):
            out += self.estimators[i].predict_proba(X[:,self.feats_used[i]])
        return out / self.n_estimators

    def predict(self, X):
        return self.predict_proba(X).argmax(axis=1)

    def score(self, X, y):
        return (self.predict(X) == y).mean() 

Here is a test script to see if the class above works as intended:

test.py

import numpy as np
from sklearn.datasets import load_breast_cancer
from FixedFeatureRFC import FixedFeatureRFC

rs = np.random.RandomState(1234)
BC = load_breast_cancer()
X,y = BC.data, BC.target
train = rs.rand(X.shape[0]) < 0.8

print "n_features =", X.shape[1]

fixed = [0,4,21]
maxf  = 10

ffrfc = FixedFeatureRFC(n_estimators=1000)
ffrfc.fit(X[train], y[train], feats_fixed=fixed, max_features=maxf)

for feats in ffrfc.feats_used:
    assert len(feats) == maxf
    for f in fixed:
        assert f in feats

print ffrfc.score(X[~train], y[~train])

The output is:

n_features = 30
0.983739837398

None of the assertions failed, indicating that the features we have chosen to be fixed were used in each random feature subsample and that the size of each feature subsample was the required max_features size. The high accuracy on the held-out data indicates that the classifier is working properly.

Upvotes: 5

Craig
Craig

Reputation: 346

I do not believe there is a way in scikit now. You could use max_features=None which removes all randomness of feature selections.

If you can switch packages, R's Ranger (https://cran.r-project.org/web/packages/ranger/ranger.pdf) has options split.select.weights and always.split.variables which may be what you are looking for. Define the probability for the random choices or always include these features in addition to the random choices.

This works against the overall design of random forest, reducing the randomness which may in turn weaken the variance reduction of the algorithm. You should know a lot about your data and the problem to choose this option. As @Michal alluded to, proceed carefully here.

Upvotes: 1

Related Questions