tb08
tb08

Reputation: 57

Precision calculation warning when using GridSearchCV for Logistic Regression

I am trying to run GridSearchCV with the LogisticRegression estimator and record the model accuracy, precision, recall, f1 metrics.

However, I get the following error on the precision metric:

Precision is ill-defined and being set to 0.0 due to no predicted samples. 
Use `zero_division` parameter to control this behavior

I understand why I am getting the error as there are no predictions with output value equal to 1 in the Kfold split. However I don't understand how I can specific set "zero_divison" as 1 in GridSearchCV (logistic_reg variable).

Original code

logistic_reg = GridSearchCV(estimator=LogisticRegression(penalty="l1", random_state=42, max_iter=10000), param_grid={
        "C": [1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 5e-2, 1e-1, 5e-1, 1, 5, 10, 20],
        "solver": ["liblinear", "saga"]
        }, scoring=["accuracy", "precision", "recall", "f1"], cv=StratifiedKFold(n_splits=10), refit="accuracy")
    
logistic_reg_X_train = self.X_train.copy()
logistic_reg_X_train.drop(self.columns_removed, axis=1, inplace=True)
    
logistic_reg.fit(logistic_reg_X_train, self.y_train)
logistic_reg_results = pd.DataFrame(logistic_reg.cv_results_)

I tried changing "precision" to precision_score(zero_division=1) but this gives me another error (missing 2 required positional arguments: 'y_true' and 'y_pred'). Again I understand this but the 2 missing parameters are not defined before applying the fit method.

How can I specify the 1zero_division parameter to the precision score metric?

Edit

What I don't understand is that I stratified the y data in my train_test_split method and used the StratifedKFold in the GridSearchCV. My understanding from this is that the train/test data will have the same split proportion of y values and the same should happen during cross validation. This means that in the gridsearchcv samples, the data should have y values of both 0 and 1 and thus precision cannot equal 0 (model will be able to calculate TP and FP as the sample test data contains samples where y is equal to 1). I'm not sure where to go from here.

Upvotes: 0

Views: 1332

Answers (1)

tb08
tb08

Reputation: 57

From reading further into this issue, my understanding is that the error is occurring because not all the labels in my y_test are appearing in my y_pred. This is not the case for my data.

I used the comment from G.Anderson to remove the warning (but it doesn't answer my question)

  • Created new custom_scorer object

  • Created customer_scoring dictionary

  • Updated GridSearchCV scoring and refit parameters

    from sklearn.metrics import precision_score, make_scorer
    
    precision_scorer = make_scorer(precision_score, zero_division=0)
    
    custom_scoring = {"accuracy": "accuracy", "precision": precision_scorer, "recall": "recall", "f1": "f1"}
    
    logistic_reg = GridSearchCV(estimator=LogisticRegression(penalty="l1", random_state=42, max_iter=10000), param_grid={
          "C": [1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 5e-2, 1e-1, 5e-1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20],
          "solver": ["liblinear", "saga"]
          }, scoring=custom_scoring, cv=StratifiedKFold(n_splits=10), refit="accuracy")
    

Edit - Answer to Question Above

I used GridSearchCV to find the best hyperparameters for the model. To view the model metrics for each split, I create a StratifedKFold estimator with the best hyperparameters and then did cross validation on its own. This gave me no precision warning messages. I have no idea why GridSearchCV is giving me a warning but atleast this way works!!!

Note: I get the same results from the method below and GridSearchCV in the question above.

skf = StratifiedKFold(n_splits=10)
logistic_reg_class_skf = LogisticRegression(penalty="l1", max_iter=10000, random_state=42, C=5, solver="liblinear")
    
logistic_reg_class_score = []
                    
for train, test in skf.split(logistic_reg_class_X_train, self.y_train):
        
    logistic_reg_class_skf_X_train = logistic_reg_class_X_train.iloc[train]
    logistic_reg_class_skf_X_test = logistic_reg_class_X_train.iloc[test]
    logistic_reg_class_skf_y_train = self.y_train.iloc[train]
    logistic_reg_class_skf_y_test = self.y_train.iloc[test]
        
    logistic_reg_class_skf.fit(logistic_reg_class_skf_X_train, logistic_reg_class_skf_y_train)
    logistic_reg_skf_y_pred = logistic_reg_class_skf.predict(logistic_reg_class_skf_X_test)
        
    skf_accuracy_score = metrics.accuracy_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)
    skf_precision_score = metrics.precision_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)
    skf_recall_score = metrics.recall_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)
    skf_f1_score = metrics.f1_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)

    logistic_reg_class_score.append([skf_accuracy_score, skf_precision_score, skf_recall_score, skf_f1_score])

    classification_results = pd.DataFrame({"Algorithm": ["Logistic Reg Train"], "Accuracy": [0.0], "Precision": [0.0],
                                            "Recall": [0.0], "F1 Score": [0.0]})
    
    for i in range (0, 10):
        classification_results.loc[i] = ["Logistic Reg Train", logistic_reg_class_score[i][0], logistic_reg_class_score[i][1],
                                         logistic_reg_class_score[2][0], logistic_reg_class_score[3][0]]

Upvotes: 1

Related Questions