Keira Marquez
Keira Marquez

Reputation: 1

What´s the random_state in DecisionTreeRegressor?

What's the difference between: DecisionTreeRegressor(splitter='random') and DecisionTreeRegressor(splitter='best')

If both seem to throw random predictions, I don't get why do both implementations use the parameter random_state

Here's an example:

import pandas as pd
from sklearn.tree import DecisionTreeRegressor

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/vehicles_train.csv'
train = pd.read_csv(url)

train['vtype'] = train.vtype.map({'car':0, 'truck':1})
feature_cols = ['year', 'miles', 'doors', 'vtype']
X = train[feature_cols]
y = train.price

treereg = DecisionTreeRegressor(splitter='best')

for i in range(1, 10):
    treereg.fit(X, y)
    print treereg.predict([1994, 10000, 2, 1])

thanks!

Upvotes: 0

Views: 2606

Answers (1)

jakevdp
jakevdp

Reputation: 86320

I can't answer this definitively, but this is what I suspect is happening:

Even for splitter="best", the algorithm used inside the decision tree explores the features in a random order (as you can see in the source). If max_features is not defined, it should explore all features and thus find the same best split regardless of the random state, as long as there is a unique best split.

My suspicion is that for the data you provided, at some point there are two possible splits that are equally good according to the specified criterion, and so the algorithm chooses the one it sees first.

Upvotes: 2

Related Questions