Reputation: 75
Question: Could you help me understand why RandomForestClassifier and XGBClassifier have exact same score?
Context: I'm working on Kaggle - Titanic problem and on my first attempt, I want to compare some common models.
Code:
pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
preprocessor = make_column_transformer(
(pipeline, ['Embarked']),
(OneHotEncoder(), ['Sex']),
#(OrdinalEncoder(), ['Cabin'])
)
models = [
RandomForestClassifier(n_estimators=1, random_state=42),
XGBClassifier(random_state=42, n_estimators=100, max_depth=42),
SGDClassifier()
]
my_pipelines = []
for model in models:
my_pipelines.append(Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
]))
for idx, pipeline in enumerate(my_pipelines):
pipeline.fit(X_train, y_train)
pred = pipeline.predict(X_valid)
print(accuracy_score(y_valid, pred))
Output:
0.770949720670391
0.770949720670391
0.6312849162011173
Thank you very much for your help!
Upvotes: 0
Views: 252
Reputation: 1003
This is true that both algorithms are tree based. However, you can see that you have a single tree in the RandomForestClassifier
so you are actually a DecisionTreeClassifier
while using an ensemble for the gradient-boosting algorithm. One could expect different results.
Thus the only thing that makes the performance to be equal is actually your data. You have only 2 features which are moreover categorical features. Therefore, with these data, you cannot learn a complex model. All trees should be identical. you could check the number of node in the tree (e.g. my_pipelines[0][-1].estimators_[0].tree_.node_count
; I have only 11).
Add 2 additional numerical features (e.g. fare and age) and you will see that the trees can further find additional rules and the performance will then change.
Upvotes: 1