Peter Lucas
Peter Lucas

Reputation: 1991

Logistic Regression: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples

Goal: Determine if rfq_num_of_dealers is a significant predictor of a Done trade (Done =1).
My Data:

df_Train_Test.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 139025 entries, 0 to 139024
Data columns (total 2 columns):
rfq_num_of_dealers    139025 non-null float64
Done                  139025 non-null uint8
dtypes: float64(1), uint8(1)

df_Train_Test = df_Train_Test[['rfq_num_of_dealers','Done']]
df_Train_Test_GrpBy = df_Train_Test.groupby(['rfq_num_of_dealers','Done']).size().reset_index(name='Count').sort_values(['rfq_num_of_dealers','Done'])
display(df_Train_Test_GrpBy)

Column rfq_num_of_dealers data range is 0 to 21 and column Done is either 0 or 1. Note all rfq_num_of_dealers have a Done value of 0 or 1.

enter image description here

Logistic regression:

x = df_Train_Test[['rfq_num_of_dealers']]
y = df_Train_Test['Done']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

# 2 Train and fit a logistic regression model on the training set. 
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()               # create instance of model
logmodel.fit(x_train,y_train)                 # fit model against the training data

# 3. Now predict values for the testing data.
predictions = logmodel.predict(x_test)        # Predict off the test data (note fit model is off train data)

# 4 Create a classification report for the model.
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))

# 5 Create a confusion matrix for the model.
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,predictions))    # The diagonals are the correct predictions

This yields the following error

 UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.       
'precision', 'predicted', average, warn_for)

The report and matrix which is clearly wrong, note the right hand side of the confusion matrix

       precision    recall  f1-score   support

          0       0.92      1.00      0.96     41981
          1       0.00      0.00      0.00      3898

avg / total       0.84      0.92      0.87     45879

[[41981     0]
 [ 3898     0]]

How can this error be raised if 'Done' has either a 1 or 0 and all are populated (y label)? Is there any code I can run to determine exactly which y labels cause the error? Other outpuuts:

display(pd.Series(predictions).value_counts())
0    45879
dtype: int64

display(pd.Series(predictions).describe())
count    45879.0
mean         0.0
std          0.0
min          0.0
25%          0.0
50%          0.0
75%          0.0
max          0.0
dtype: float64

display(y_test)
71738     0
39861     0
16567     0
81750     1
88513     0
16314     0
113822    0
.         .

display(y_test.describe())
count    45879.000000
mean         0.084963
std          0.278829
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: Done, dtype: float64

display(y_test.value_counts())
0    41981
1     3898
Name: Done, dtype: int64

Could this have something to do with the fact that there are 12439 records both with rfq_num_of_dealers and Done equalling zero?

Upvotes: 1

Views: 1892

Answers (1)

Ami Tavory
Ami Tavory

Reputation: 76297

Precision is a ratio:

precision = tp / (tp + fp)

The error is telling you that the ratio is undefined, almost surely because the denominator is 0. That is, there are no test true positives and false positives. Looking at the commonality, these are test positives.

It is very probable that your classifier is not predicting positives at all on the test data.

Before dividing into train and test, you might want to randomize the order of your instances (or stratify) - it's possible that there's something systematic about the original order. This might solve the problem or not, but, again, it looks like the problem is lack of predicted true in the test dataset.

Upvotes: 2

Related Questions