user14316330
user14316330

Reputation: 61

Logistic Regression Model in Python Has good Accuracy and Precision,but predictions are way off

I built a Logistic Regression Model to predict Loan Acceptors. The dataset is 94% non acceptors and 6% acceptors. I've run several logistic regression models one with the original dataset, one after upsampling to 50/50 and removing some predictor variables, and one without the upsampling, but after removing some predictor variables.

Model 1: Better than 90% accuracy, precision and recall on 25 feature columns. After running the model, I output the predicting to a different CSV (same people as original csv though) and it's returning 10,000 acceptors. My guess was this could be caused by overfitting? Wasn't sure, but then tried it on the same 94% non-acceptors and 6% acceptors, but with fewer variables (19 feature columns). This time the accuracy is 81%, but the precision is only 21%, while recall is 765 (for training and test). This time it only returns 8 total acceptors ( out of 18,000)

Finally, I tried upsampling and upsampled to a balanced set. The accuracy is only 68% (which I can work with) and the precision and recall is 66% for both training and test. Ran the model then outputted the prediction to the csv file (again same people, different CSV file, not sure if that's messing it up) and this time it returned 0 acceptors.

Does anyone have any advice on what is causing this and how I can fix this?

I'm not sure which regression code would be most beneficial. I'm happy to post the upsampling code if that would be more helpful.

import statsmodels.api as sm

y=df.OpenedLCInd.values

X=df.drop('OpenedLCInd', axis = 1)

cols=X.columns

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

minmax= MinMaxScaler()
X=pd.DataFrame(minmax.fit_transform(X))
X.columns = cols

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_curve, auc, confusion_matrix

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state= 33)

logreg=LogisticRegression(fit_intercept = False, C=1e12, solver ='liblinear', class_weight='balanced')

logreg.fit(X_train, y_train)

y_hat_train = logreg.predict(X_train)
y_hat_test = logreg.predict(X_test)

residuals = np.abs(y_train - y_hat_train)

logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()
print(result.summary())

print(pd.Series(residuals).value_counts())
print(pd.Series(residuals).value_counts(normalize=True))

## Output predictions to new dataset

test=pd.read_csv(r'link')

predictions = logreg.predict(X_test)


test_predictions = logreg.predict(test.drop('OpenedLCInd', axis = 1))
                                
test["predictions"] = test_predictions

test.to_csv(r'output link')

Upvotes: 0

Views: 1221

Answers (1)

Kate Melnykova
Kate Melnykova

Reputation: 1873

You don't use a validation set (test set in the code above). To fix it, let residuals = np.abs(y_test - y_hat_test) instead of using y_train.

Also, it is useful to apply cross-validation to ensure that the model is consistently good.

Upvotes: 2

Related Questions