Rashmi Singh
Rashmi Singh

Reputation: 11

Logistic Regression: how to compare predicted value with a threshold and get the classification done

I have this Credit Default dataset with head like this:

default student balance      income        default_Yes

No      No      729.526495   44361.625074   0 

No      Yes     817.180407   12106.134700   0 

No      No      1073.549164  31767.138947   0 

No      No      529.250605   35704.493935   0 

No      No      785.655883   38463.495879   0 

I am trying to perform logistic regression for 'default_Yes' based on the 'balance' attribute and used the following function:

 from sklearn.cross_validation import train_test_split
 from sklearn import metrics
 X = cred_def[['balance']]
 Y = cred_def['default_Yes']
 X_train, X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3,random_state=76)
 logist = LogisticRegression()
 logist.fit(X_train,Y_train)
 y_pred = logist.predict(X_test)


 def model(threshold):
     def_thresh = np.greater(y_pred, threshold).astype(int)
     acc_score = metrics.accuracy_score(Y_test, def_thresh)
     print(acc_score)
     plt.scatter(X_test.values,Y_test.values)
     plt.scatter(X_test.values,def_thresh)
     conf = metrics.confusion_matrix(Y_test, y_pred)
     print(conf)

The problem I am facing is: no matter what value of threshold I am passing to the function 'model', it's producing same output and not considering the value passed.

Upvotes: 0

Views: 504

Answers (1)

smci
smci

Reputation: 33940

EDIT (in response to the first two edits of this question statement): you don't pass any parameters whatsoever to logist = LogisticRegression(). You pass random_state=True to train_test_split(). Not to LogisticRegression.

random_state is supposed to be an integer (random seed), not boolean - read the doc. So by passing True, which will get coerced to 1, you just keep setting random_state = 1.

Try it on some other integer values and you'll get different results.

EDIT2: Your issue had nothing to do with random_state parameter as originally titled. It is to do with your predicted values y_pred = logist.predict(X_test), and specifically how behave as you sweep your threshold parameter across the possible range [0,1] of LR output values. Show us a table with at least five different values of threshold. Like [0,0.25,0.5,0.75,1.0], and whatever value you mean by "the result". Next, what do you mean by "the result"? Your accuracy acc_score, your confusion matrix conf, or what? For now, forget confusion matrix. Just look at say the effect of applying different values of threshold to the same array of predicted values y_pred. Also, you want to sanity-check y_pred, inspect it. Is it all-one? all-zero? What are its mean, median etc. Please post a table of data. Do not just keep saying "it doesn't work".

Upvotes: 1

Related Questions