Reputation: 1
What's the difference between: DecisionTreeRegressor(splitter='random') and DecisionTreeRegressor(splitter='best')
If both seem to throw random predictions, I don't get why do both implementations use the parameter random_state
Here's an example:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/vehicles_train.csv'
train = pd.read_csv(url)
train['vtype'] = train.vtype.map({'car':0, 'truck':1})
feature_cols = ['year', 'miles', 'doors', 'vtype']
X = train[feature_cols]
y = train.price
treereg = DecisionTreeRegressor(splitter='best')
for i in range(1, 10):
treereg.fit(X, y)
print treereg.predict([1994, 10000, 2, 1])
thanks!
Upvotes: 0
Views: 2606
Reputation: 86320
I can't answer this definitively, but this is what I suspect is happening:
Even for splitter="best"
, the algorithm used inside the decision tree explores the features in a random order (as you can see in the source). If max_features
is not defined, it should explore all features and thus find the same best split regardless of the random state, as long as there is a unique best split.
My suspicion is that for the data you provided, at some point there are two possible splits that are equally good according to the specified criterion, and so the algorithm chooses the one it sees first.
Upvotes: 2