Reputation: 1291
I have a binomial classification problem on the dataset with mixed data,
department category
promoted category
review float64
projects int64
salary category
tenure float64
satisfaction float64
bonus category
avg_hrs_month float64
left int32
dtype: object
I have tried to run a DecisionTreeClassifier() with gridsearch and am getting different results depending on whether I use ColumnTransformer() and Pipeline().
with my code without a pipeline, I am getting the best score of 0.925.
# encoding ordinal and categorical variables
oe = OrdinalEncoder()
df[["salary","left"]] = oe.fit_transform(df[["salary", "left"]])
ohe = OneHotEncoder()
df[ohe.get_feature_names_out(['department']).tolist()] = ohe.fit_transform(df.iloc[:, :1].values).toarray()
list_of_names = ohe.get_feature_names_out(['department']).tolist()
df[[s.replace("department_", "") for s in list_of_names]] = ohe.transform(df.iloc[:, :1].values).toarray()
df = df.loc[:,~df.columns.str.contains('department')].copy()
df
# modelling
cross_val_df, test_df = train_test_split(over_df, train_size=0.80, test_size=0.20, random_state=32)
rfc = RandomForestClassifier(random_state=32)
parameter_grid = {
'max_depth':[30, 40, 50, 70],
'n_estimators':[ 80, 100, 120, 130, 150]
}
grid_search = GridSearchCV(estimator=rfc, cv=6, param_grid=parameter_grid, scoring='f1').fit(cross_val_df.drop("left", axis=1), cross_val_df["left"])
grid_search.best_params_
{'max_depth': 40, 'n_estimators': 150}
grid_search.best_score_
0.9252745706259379
however, with my code with a pipeline, I am getting only 0.599.
# encoding ordinal and categorical features
ord_features = ["salary"]
ordinal_transformer = OrdinalEncoder()
cat_features = ["department"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
# column transformer
preprocessor = ColumnTransformer(
transformers=[
("ord", ordinal_transformer, ord_features),
("cat", categorical_transformer, cat_features ),
]
)
# pipeline
clf = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier())]
)
cross_val_df, test_df = train_test_split(df_over, train_size=0.80, test_size=0.20, random_state=32)
param_grid = {
'classifier__max_depth':[30, 40, 50,70],
'classifier__n_estimators':[ 80, 100, 120, 130, 150]
}
grid_search = GridSearchCV(clf, param_grid,scoring='f1', cv=6)
grid_search.fit(cross_val_df.drop("left", axis=1), cross_val_df["left"])
grid_search.best_params_
{'classifier__max_depth': 70, 'classifier__n_estimators': 80}
grid_search.best_score_
0.5990146701866147
clf.set_params(**grid_search.best_params_)
clf.fit(cross_val_df.drop("left", axis=1), cross_val_df["left"])
clf.score(test_df.drop("left", axis=1), test_df["left"])
0.514613392526822
I am new to pipelines and am trying to understand why my results are different. Thank you.
Upvotes: 0
Views: 275
Reputation: 4539
There are three things that I see.
OrdinalEncoder
and the OneHotEncoder
on the full dataset. This is something you should avoid doing, you should always fit it using the training set. Otherwise, your evaluation is not fair as there might be features present in the overall data that are not part of your training set. This means you are leaking information.Upvotes: 1