Bluetail
Bluetail

Reputation: 1291

Different score and best parameter values after using pipeline

I have a binomial classification problem on the dataset with mixed data,

department       category
promoted         category
review            float64
projects            int64
salary           category
tenure            float64
satisfaction      float64
bonus            category
avg_hrs_month     float64
left                int32
dtype: object

I have tried to run a DecisionTreeClassifier() with gridsearch and am getting different results depending on whether I use ColumnTransformer() and Pipeline().

with my code without a pipeline, I am getting the best score of 0.925.

# encoding ordinal and categorical variables
oe = OrdinalEncoder()
df[["salary","left"]] = oe.fit_transform(df[["salary", "left"]])

ohe = OneHotEncoder()
df[ohe.get_feature_names_out(['department']).tolist()] = ohe.fit_transform(df.iloc[:, :1].values).toarray()
list_of_names = ohe.get_feature_names_out(['department']).tolist()
df[[s.replace("department_", "") for s in list_of_names]] = ohe.transform(df.iloc[:, :1].values).toarray()

df = df.loc[:,~df.columns.str.contains('department')].copy() 
df

# modelling
cross_val_df, test_df = train_test_split(over_df, train_size=0.80, test_size=0.20, random_state=32) 
rfc = RandomForestClassifier(random_state=32)

parameter_grid = {
  'max_depth':[30, 40, 50, 70], 
  'n_estimators':[ 80, 100, 120, 130, 150]
}

grid_search = GridSearchCV(estimator=rfc, cv=6, param_grid=parameter_grid, scoring='f1').fit(cross_val_df.drop("left", axis=1), cross_val_df["left"])

grid_search.best_params_
{'max_depth': 40, 'n_estimators': 150}
grid_search.best_score_
0.9252745706259379

however, with my code with a pipeline, I am getting only 0.599.


# encoding ordinal and categorical features
ord_features = ["salary"]
ordinal_transformer = OrdinalEncoder()


cat_features = ["department"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

# column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("ord", ordinal_transformer, ord_features),
        ("cat", categorical_transformer, cat_features ),
           ]
)

# pipeline
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier())]
)

cross_val_df, test_df = train_test_split(df_over, train_size=0.80, test_size=0.20, random_state=32)

param_grid = {
  'classifier__max_depth':[30, 40, 50,70], 
  'classifier__n_estimators':[ 80, 100, 120, 130, 150]
}


grid_search = GridSearchCV(clf, param_grid,scoring='f1', cv=6)

grid_search.fit(cross_val_df.drop("left", axis=1), cross_val_df["left"])

grid_search.best_params_
{'classifier__max_depth': 70, 'classifier__n_estimators': 80}

grid_search.best_score_
0.5990146701866147

clf.set_params(**grid_search.best_params_)


clf.fit(cross_val_df.drop("left", axis=1), cross_val_df["left"])

clf.score(test_df.drop("left", axis=1), test_df["left"])
0.514613392526822

I am new to pipelines and am trying to understand why my results are different. Thank you.

Upvotes: 0

Views: 275

Answers (1)

Simon Hawe
Simon Hawe

Reputation: 4539

There are three things that I see.

  1. In your first example, you are fitting the OrdinalEncoder and the OneHotEncoder on the full dataset. This is something you should avoid doing, you should always fit it using the training set. Otherwise, your evaluation is not fair as there might be features present in the overall data that are not part of your training set. This means you are leaking information.
  2. That is something I am not sure about, but it looks as if you are passing over_df to train_test_split and not the transformed data frame df. If those are the same, then you can ignore that answer but that is not obvious from your code.
  3. In the pipeline, you will only use the features department and salary, as all other columns will be dropped by the ColumnTransformer step. However, in your first example, it looks as if you are using all your columns. Again, this is just what I am expecting to happen but I am not 100% certain from the code you've shared.

Upvotes: 1

Related Questions