Handling false positive of the classifiers and improving performance when trained with medium sized unbalanced dataset with two features

Question

I have the following unbalanced data set with two features (keon i.e. gender and alder i.e. age) that was balanced using under_sampling method which I trained on different classifier to predict the call_ending_reason where 0 is No and 1 is Yes:

The balanced dataset with both 1 and 0 have same kind of distribution which can be visualized like this:

However, after performing under_sampling method on the above shown dataset and training both type of dataset in various classifier from sklearn, the balanced dataset is detecting 1s high precision but 0s with very low precision. The opposite happens when I use the main dataset.

Here is the code:

x = filtered_data_limited_features_with_yes_no
y = filtered_data_limited_features_with_yes_no['call_ending_reason']
del x['call_ending_reason']

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size = 0.80)

from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
# rfc = MLPClassifier(verbose=True,hidden_layer_sizes=(100,50,10),learning_rate='constant',learning_rate_init=0.0001, n_iter_no_change=50, max_iter=100)
# rfc = GaussianNB()
rfc=RandomForestClassifier()
param_grid = { 
    'n_estimators': [50,100,200,500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'criterion' :['gini', 'entropy']
}
CV_rfc_all_data = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 10)
# rfc = LinearSVC()
CV_rfc_all_data.fit(X_train, y_train)

from sklearn.metrics import classification_report
print(classification_report(y_test, CV_rfc_all_data.predict(X_test)))

from imblearn.under_sampling import RandomUnderSampler

ros = RandomUnderSampler( random_state=1)
df_balanced, balanced_labels = ros.fit_resample(x, y)

####TRAINING AND PREDICTING CLASSIFIER BASED ON BALANCED DATASET
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(df_balanced, balanced_labels, train_size = 0.70)

from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier


# rfc=RandomForestClassifier()
# param_grid = { 
#     'n_estimators': [50,100,200,500],
#     'max_features': ['auto', 'sqrt', 'log2'],
#     'criterion' :['gini', 'entropy']
# }
# CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 10)
# CV_rfc = MLPClassifier(verbose=True,hidden_layer_sizes=(100,50,10),learning_rate='invscaling',learning_rate_init=0.0003, n_iter_no_change=50, max_iter=100)
CV_rfc = DecisionTreeClassifier()
CV_rfc.fit(X_train, y_train)
# CV_rfc.best_params_

Questions:

Given the visualization:

What classifier should be used to train the classifier with more than 65% precision for both predicting 1 and 0
Do I need to scale the data given its only 2 features? If so how should I do that properly to scale both training and testing data

Handling false positive of the classifiers and improving performance when trained with medium sized unbalanced dataset with two features

Answers (1)

Related Questions