Reputation: 61
I built a Logistic Regression Model to predict Loan Acceptors. The dataset is 94% non acceptors and 6% acceptors. I've run several logistic regression models one with the original dataset, one after upsampling to 50/50 and removing some predictor variables, and one without the upsampling, but after removing some predictor variables.
Model 1: Better than 90% accuracy, precision and recall on 25 feature columns. After running the model, I output the predicting to a different CSV (same people as original csv though) and it's returning 10,000 acceptors. My guess was this could be caused by overfitting? Wasn't sure, but then tried it on the same 94% non-acceptors and 6% acceptors, but with fewer variables (19 feature columns). This time the accuracy is 81%, but the precision is only 21%, while recall is 765 (for training and test). This time it only returns 8 total acceptors ( out of 18,000)
Finally, I tried upsampling and upsampled to a balanced set. The accuracy is only 68% (which I can work with) and the precision and recall is 66% for both training and test. Ran the model then outputted the prediction to the csv file (again same people, different CSV file, not sure if that's messing it up) and this time it returned 0 acceptors.
Does anyone have any advice on what is causing this and how I can fix this?
I'm not sure which regression code would be most beneficial. I'm happy to post the upsampling code if that would be more helpful.
import statsmodels.api as sm
y=df.OpenedLCInd.values
X=df.drop('OpenedLCInd', axis = 1)
cols=X.columns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
minmax= MinMaxScaler()
X=pd.DataFrame(minmax.fit_transform(X))
X.columns = cols
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, roc_curve, auc, confusion_matrix
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state= 33)
logreg=LogisticRegression(fit_intercept = False, C=1e12, solver ='liblinear', class_weight='balanced')
logreg.fit(X_train, y_train)
y_hat_train = logreg.predict(X_train)
y_hat_test = logreg.predict(X_test)
residuals = np.abs(y_train - y_hat_train)
logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()
print(result.summary())
print(pd.Series(residuals).value_counts())
print(pd.Series(residuals).value_counts(normalize=True))
## Output predictions to new dataset
test=pd.read_csv(r'link')
predictions = logreg.predict(X_test)
test_predictions = logreg.predict(test.drop('OpenedLCInd', axis = 1))
test["predictions"] = test_predictions
test.to_csv(r'output link')
Upvotes: 0
Views: 1221
Reputation: 1873
You don't use a validation set (test set in the code above). To fix it, let
residuals = np.abs(y_test - y_hat_test)
instead of using y_train
.
Also, it is useful to apply cross-validation to ensure that the model is consistently good.
Upvotes: 2