Different score and best parameter values after using pipeline

Question

I have a binomial classification problem on the dataset with mixed data,

department       category
promoted         category
review            float64
projects            int64
salary           category
tenure            float64
satisfaction      float64
bonus            category
avg_hrs_month     float64
left                int32
dtype: object

I have tried to run a DecisionTreeClassifier() with gridsearch and am getting different results depending on whether I use ColumnTransformer() and Pipeline().

with my code without a pipeline, I am getting the best score of 0.925.

# encoding ordinal and categorical variables
oe = OrdinalEncoder()
df[["salary","left"]] = oe.fit_transform(df[["salary", "left"]])

ohe = OneHotEncoder()
df[ohe.get_feature_names_out(['department']).tolist()] = ohe.fit_transform(df.iloc[:, :1].values).toarray()
list_of_names = ohe.get_feature_names_out(['department']).tolist()
df[[s.replace("department_", "") for s in list_of_names]] = ohe.transform(df.iloc[:, :1].values).toarray()

df = df.loc[:,~df.columns.str.contains('department')].copy() 
df

# modelling
cross_val_df, test_df = train_test_split(over_df, train_size=0.80, test_size=0.20, random_state=32) 
rfc = RandomForestClassifier(random_state=32)

parameter_grid = {
  'max_depth':[30, 40, 50, 70], 
  'n_estimators':[ 80, 100, 120, 130, 150]
}

grid_search = GridSearchCV(estimator=rfc, cv=6, param_grid=parameter_grid, scoring='f1').fit(cross_val_df.drop("left", axis=1), cross_val_df["left"])

grid_search.best_params_
{'max_depth': 40, 'n_estimators': 150}
grid_search.best_score_
0.9252745706259379

however, with my code with a pipeline, I am getting only 0.599.


# encoding ordinal and categorical features
ord_features = ["salary"]
ordinal_transformer = OrdinalEncoder()


cat_features = ["department"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

# column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("ord", ordinal_transformer, ord_features),
        ("cat", categorical_transformer, cat_features ),
           ]
)

# pipeline
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier())]
)

cross_val_df, test_df = train_test_split(df_over, train_size=0.80, test_size=0.20, random_state=32)

param_grid = {
  'classifier__max_depth':[30, 40, 50,70], 
  'classifier__n_estimators':[ 80, 100, 120, 130, 150]
}


grid_search = GridSearchCV(clf, param_grid,scoring='f1', cv=6)

grid_search.fit(cross_val_df.drop("left", axis=1), cross_val_df["left"])

grid_search.best_params_
{'classifier__max_depth': 70, 'classifier__n_estimators': 80}

grid_search.best_score_
0.5990146701866147

clf.set_params(**grid_search.best_params_)


clf.fit(cross_val_df.drop("left", axis=1), cross_val_df["left"])

clf.score(test_df.drop("left", axis=1), test_df["left"])
0.514613392526822

I am new to pipelines and am trying to understand why my results are different. Thank you.

Different score and best parameter values after using pipeline

Answers (1)

Related Questions