Reputation: 3548
I'm new to pandas & scikit learn. I've been able to put together a simple model - Bad
& Good
df = pd.read_csv('pandas_model.csv', header=None, names=['label', 'resume'])
X = df.resume.astype('U').values
y = df.label
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1)
vect = TfidfVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
## create test
X_test_dtm = vect.transform(X_test)
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)
y_pred_class = logreg.predict(X_test_dtm)
score = metrics.accuracy_score(y_test, y_pred_class)
# print('LogReg Accuracy Score: ' % str(score))
print(score)
log_reg_cf = metrics.confusion_matrix(y_test, y_pred_class)
print(log_reg_cf)
Confusion Matrix:
[[2696 165]
[ 742 424]]
It looks like it is guessing too many data points as "No" when they should be "Yes" (742).
I read that SciKit learn uses the .5
as the threshold to make a decision from the predict_proba()
score.
I'm trying to put together way to "test" various thresholds - ie instead of .5
, it's .4
, which will move some of guessed data points from False Negative to being correctly guessed as Good
.
logreg.predict_proba(X_test_dtm)
gives me a 2D array of the scores (Bad / Good)
array([[0.59946085, 0.40053915], ## guessed as bad, but if the threshold was .6, it would be guessed as good. This is what I'm trying to run simulations on
[0.89679281, 0.10320719],
[0.328435 , 0.671565 ],
...,
[0.50415322, 0.49584678],
[0.84380259, 0.15619741],
[0.85216752, 0.14783248]])
y_test.head()
gives me the true value (btw, what does 5369 represent? row number?)
5369 Bad
11313 Bad
11899 Good
3856 Bad
1961 Bad
Ideally I'm trying to run simulations to do something across all of the X_train_dtm
data:
if X_train_dtm[0] (bad score) > .6 (instead of .5):
then
resut = bad
else
result = good
and then re-check it against y_test()
and re-check the accuracy score
It doesn't seem like there is anyway to move that .5
threshold in SciKit learn and looks like I have to do it manually.
Basically trying to make it "harder" for a data point to be guessed as no
Hopefully I've worded this question so it makes sense
I am getting an error from the question marked as a duplicate
from sklearn.metrics import precision_recall_curve
probs_y=logreg.predict_proba(X_test_dtm)
precision, recall, thresholds = precision_recall_curve(y_test, probs_y[:, 0])
ValueError: Data is not binary and pos_label is not specified
Upvotes: 0
Views: 3689
Reputation: 24904
IIUC you could do it really simply (at least for binary code) with predict_proba
:
probabilities = logreg.predict_proba(X_test_dtm)
threshold = 0.4
good = probabilities[:, 1]
predicted_good = good > threshold
This would give you a binary prediction for good
case if it's probability is higher than 0.5
.
You can easily generalize code above to test any threshold you like with whatever metric you like which requires binary prediction.
Upvotes: 4