How to fix a very high false negative rate in fairly balanced binary classification?

Question

I have a project that asks to do binary classification for whether an employee will leave the company or not, based on about 52 features and 2000 rows of data. The data is somewhat balanced with 1200 neg to 800 pos. I have done extensive EDA and data cleansing. I chose to try several different models from sklearn, Logarithmic Regression, SVM, and Random Forests. I am getting very poor and similar results from all of them. I only used 15 of the 52 features for this run, but the results are almost identical to when I used all 52 features. Of the 52 features, 6 were categorical that I converted to dummies (between 3-6 categories per feature) and 3 were datetime that I converted to days-since-epoch. There were no null values to fill.

This is the code and confusion matrix my most recent run with a random forest.

x_train, x_test, y_train, y_test = train_test_split(small_features, endreason, test_size=0.2, random_state=0)


RF = RandomForestClassifier(bootstrap = True,
                            max_features = 'sqrt',
                            random_state=0)
RF.fit(x_train, y_train)
RF.predict(x_test)


cm = confusion_matrix(y_test, rf_predictions)
plot_confusion_matrix(cm, classes = ['Negative', 'Positive'],
                      title = 'Confusion Matrix')

What are steps I can do to help better fit this model?

Celius Stingher · Accepted Answer

The results you are showing definitely seem a bit dis-encouraging for the methods your propose and balance of the data you describe. However, from the description of the problem there definitely seems to be a lot of room for improvement.

When you are using train_test_split make sure you pass stratify=endreason to make sure there's no issues regarding the labels when splitting the dataset. Moving on to helpful points to improve your model:

First of all, dimensionality reduction: Since you are dealing with many features, some of them might be useless or even contaminate the classification problem you are trying to solve. It is very important to considering fitting different dimension reduction techniques to your data and using this fitted data to feed your model. Some common approaches that might be worth trying:

PCA (Principal component analysis)
Low Variance & Correlation filter
Random Forests feature importance

Secondly understanding the model: While Logistic Regression might prove to be an excellent baseline for a linear classifier, it might not necessarily be what you need for this task. Random Forests seem to be much better when capturing non-linear relationships but needs to be controlled and pruned to avoid overfitting and might require a lot of data. On the other hand, SVM is a very powerful method with non-linear kernels but might prove inefficient when working with huge amounts of data. XGBoost and LightGBM are very powerful gradient boosting algorithms that have won multiple kaggle competitions and work very well in almost every case, of course there needs to be some preprocessing as XGBoost is not preparred to work with categorical features (LightGBM is). My suggestion is to try these last 2 methods. From worse to last (in general case scenarios) I would list:

LightGBM / XGBoost
RandomForest / SVM / Logistic Regression

Last but not least hyperparameter tunning: Regardless of the method you choose, there will always be some fine-tuning that needs to be done. Sklearn offers gridsearch which comes in really handy. However you would need to understand how your classifiers are behaving in order to know what should you be looking for. I will not go in-depth in this as it will be off-topic and not suited for SO but you can definitely have a read here

How to fix a very high false negative rate in fairly balanced binary classification?

Answers (1)

Related Questions