Reputation: 1991
Goal: Determine if rfq_num_of_dealers is a significant predictor of a Done trade (Done =1).
My Data:
df_Train_Test.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 139025 entries, 0 to 139024
Data columns (total 2 columns):
rfq_num_of_dealers 139025 non-null float64
Done 139025 non-null uint8
dtypes: float64(1), uint8(1)
df_Train_Test = df_Train_Test[['rfq_num_of_dealers','Done']]
df_Train_Test_GrpBy = df_Train_Test.groupby(['rfq_num_of_dealers','Done']).size().reset_index(name='Count').sort_values(['rfq_num_of_dealers','Done'])
display(df_Train_Test_GrpBy)
Column rfq_num_of_dealers data range is 0 to 21 and column Done is either 0 or 1. Note all rfq_num_of_dealers have a Done value of 0 or 1.
Logistic regression:
x = df_Train_Test[['rfq_num_of_dealers']]
y = df_Train_Test['Done']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
# 2 Train and fit a logistic regression model on the training set.
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression() # create instance of model
logmodel.fit(x_train,y_train) # fit model against the training data
# 3. Now predict values for the testing data.
predictions = logmodel.predict(x_test) # Predict off the test data (note fit model is off train data)
# 4 Create a classification report for the model.
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))
# 5 Create a confusion matrix for the model.
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,predictions)) # The diagonals are the correct predictions
This yields the following error
UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
The report and matrix which is clearly wrong, note the right hand side of the confusion matrix
precision recall f1-score support
0 0.92 1.00 0.96 41981
1 0.00 0.00 0.00 3898
avg / total 0.84 0.92 0.87 45879
[[41981 0]
[ 3898 0]]
How can this error be raised if 'Done' has either a 1 or 0 and all are populated (y label)? Is there any code I can run to determine exactly which y labels cause the error? Other outpuuts:
display(pd.Series(predictions).value_counts())
0 45879
dtype: int64
display(pd.Series(predictions).describe())
count 45879.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
dtype: float64
display(y_test)
71738 0
39861 0
16567 0
81750 1
88513 0
16314 0
113822 0
. .
display(y_test.describe())
count 45879.000000
mean 0.084963
std 0.278829
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: Done, dtype: float64
display(y_test.value_counts())
0 41981
1 3898
Name: Done, dtype: int64
Could this have something to do with the fact that there are 12439 records both with rfq_num_of_dealers and Done equalling zero?
Upvotes: 1
Views: 1892
Reputation: 76297
Precision is a ratio:
precision = tp / (tp + fp)
The error is telling you that the ratio is undefined, almost surely because the denominator is 0. That is, there are no test true positives and false positives. Looking at the commonality, these are test positives.
It is very probable that your classifier is not predicting positives at all on the test data.
Before dividing into train and test, you might want to randomize the order of your instances (or stratify) - it's possible that there's something systematic about the original order. This might solve the problem or not, but, again, it looks like the problem is lack of predicted true in the test dataset.
Upvotes: 2