Reputation: 283
My logistic predict model is giving me a training accuracy of 80% and testing accuracy of 79%.
Training Model Accuracy:0.8039535210772422 Testing Model Accuracy:0.7937496044721021
My confusion matrix give me these values:
Using hyper parameter tuning and printing my classification reports:
precision recall f1-score support
0 0.87 0.88 0.87 172299
1 0.77 0.70 0.74 17321
micro avg 0.85 0.85 0.85 189620
macro avg 0.77 0.74 0.76 189620
weighted avg 0.85 0.85 0.85 189620
When i compare the results to actual data it i tested the prediction model on only 40% of the data matches. How could i improve my actual output.
This is my code any suggestions would be really helpful.
# Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
log_param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}
# Setup the GridSearchCV object: logReg_cv
logReg=LogisticRegression()
logReg_cv = GridSearchCV(logReg,log_param_grid,cv=5)
y=predict_pi.P_I
X=pd.get_dummies(X)
test=pd.get_dummies(test)
extra_cols_train = [i for i in list(test) if i not in list(X)]
extra_cols_test = [i for i in list(X) if i not in list(test)]
X = X.reindex(columns=X.columns.tolist() + extra_cols_train)
X[extra_cols_train] = 0
test = test.reindex(columns=test.columns.tolist() + extra_cols_test)
test[extra_cols_test] = 0
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4,random_state=42)
logReg_cv.fit(X_train,y_train)
pred_pi=logReg_cv.predict(X_test)
test_pi=logReg_cv.predict(X_train)
print("Training Model Accuracy:{}".format(accuracy_score(y_train,test_pi)))
print("Testing Model Accuracy:{}".format(accuracy_score(y_test,pred_pi)))
print(confusion_matrix(y_test, pred_pi))
print(classification_report(y_test, pred_pi))
print("Tuned Logistic Regression Parameter: {}".format(logReg_cv.best_params_))
print("Tuned Logistic Regression Accuracy: {}".format(logReg_cv.best_score_))
Upvotes: 0
Views: 92
Reputation: 41
This may mean that your model is overfitting your training data. Have you done a EDA on your actual data to see if its behavior is what you expected and your training/testing data actually represents your actual data.
Is your training set a subset of your actual data? I would recommend train your model on your actual data entirely, use every bit of data you have for the training.
When you test your model, I would recommend using Cross Validation. When you do like 5 folds or 10 folds of training/testing on your data, you should have a decent model.
Upvotes: 0