Reputation: 11
I have this Credit Default dataset with head like this:
default student balance income default_Yes
No No 729.526495 44361.625074 0
No Yes 817.180407 12106.134700 0
No No 1073.549164 31767.138947 0
No No 529.250605 35704.493935 0
No No 785.655883 38463.495879 0
I am trying to perform logistic regression for 'default_Yes' based on the 'balance' attribute and used the following function:
from sklearn.cross_validation import train_test_split
from sklearn import metrics
X = cred_def[['balance']]
Y = cred_def['default_Yes']
X_train, X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3,random_state=76)
logist = LogisticRegression()
logist.fit(X_train,Y_train)
y_pred = logist.predict(X_test)
def model(threshold):
def_thresh = np.greater(y_pred, threshold).astype(int)
acc_score = metrics.accuracy_score(Y_test, def_thresh)
print(acc_score)
plt.scatter(X_test.values,Y_test.values)
plt.scatter(X_test.values,def_thresh)
conf = metrics.confusion_matrix(Y_test, y_pred)
print(conf)
The problem I am facing is: no matter what value of threshold
I am passing to the function 'model', it's producing same output and not considering the value passed.
Upvotes: 0
Views: 504
Reputation: 33940
EDIT (in response to the first two edits of this question statement): you don't pass any parameters whatsoever to logist = LogisticRegression()
. You pass random_state=True
to train_test_split()
. Not to LogisticRegression
.
random_state
is supposed to be an integer (random seed), not boolean - read the doc. So by passing True
, which will get coerced to 1, you just keep setting random_state = 1
.
Try it on some other integer values and you'll get different results.
EDIT2: Your issue had nothing to do with random_state
parameter as originally titled. It is to do with your predicted values y_pred = logist.predict(X_test)
, and specifically how behave as you sweep your threshold
parameter across the possible range [0,1] of LR output values. Show us a table with at least five different values of threshold. Like [0,0.25,0.5,0.75,1.0], and whatever value you mean by "the result". Next, what do you mean by "the result"? Your accuracy acc_score, your confusion matrix conf, or what? For now, forget confusion matrix. Just look at say the effect of applying different values of threshold to the same array of predicted values y_pred
. Also, you want to sanity-check y_pred
, inspect it. Is it all-one? all-zero? What are its mean, median etc. Please post a table of data. Do not just keep saying "it doesn't work".
Upvotes: 1