Shihab Ullah
Shihab Ullah

Reputation: 73

Handling false positive of the classifiers and improving performance when trained with medium sized unbalanced dataset with two features

I have the following unbalanced data set with two features (keon i.e. gender and alder i.e. age) that was balanced using under_sampling method which I trained on different classifier to predict the call_ending_reason where 0 is No and 1 is Yes:

enter image description here

The balanced dataset with both 1 and 0 have same kind of distribution which can be visualized like this: enter image description here

However, after performing under_sampling method on the above shown dataset and training both type of dataset in various classifier from sklearn, the balanced dataset is detecting 1s high precision but 0s with very low precision. The opposite happens when I use the main dataset.

Here is the code:

x = filtered_data_limited_features_with_yes_no
y = filtered_data_limited_features_with_yes_no['call_ending_reason']
del x['call_ending_reason']

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size = 0.80)

from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
# rfc = MLPClassifier(verbose=True,hidden_layer_sizes=(100,50,10),learning_rate='constant',learning_rate_init=0.0001, n_iter_no_change=50, max_iter=100)
# rfc = GaussianNB()
rfc=RandomForestClassifier()
param_grid = { 
    'n_estimators': [50,100,200,500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'criterion' :['gini', 'entropy']
}
CV_rfc_all_data = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 10)
# rfc = LinearSVC()
CV_rfc_all_data.fit(X_train, y_train)

from sklearn.metrics import classification_report
print(classification_report(y_test, CV_rfc_all_data.predict(X_test)))

from imblearn.under_sampling import RandomUnderSampler

ros = RandomUnderSampler( random_state=1)
df_balanced, balanced_labels = ros.fit_resample(x, y)

####TRAINING AND PREDICTING CLASSIFIER BASED ON BALANCED DATASET
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(df_balanced, balanced_labels, train_size = 0.70)

from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier


# rfc=RandomForestClassifier()
# param_grid = { 
#     'n_estimators': [50,100,200,500],
#     'max_features': ['auto', 'sqrt', 'log2'],
#     'criterion' :['gini', 'entropy']
# }
# CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 10)
# CV_rfc = MLPClassifier(verbose=True,hidden_layer_sizes=(100,50,10),learning_rate='invscaling',learning_rate_init=0.0003, n_iter_no_change=50, max_iter=100)
CV_rfc = DecisionTreeClassifier()
CV_rfc.fit(X_train, y_train)
# CV_rfc.best_params_

Questions:

Given the visualization:

  1. What classifier should be used to train the classifier with more than 65% precision for both predicting 1 and 0
  2. Do I need to scale the data given its only 2 features? If so how should I do that properly to scale both training and testing data

Upvotes: 1

Views: 91

Answers (1)

Benjamin Breton
Benjamin Breton

Reputation: 1547

You can try to set the class_weight="balanced" argument the models, it is supported in most of the models that are supported by scikit-learn It won't be magic, but in my experience, it usually works better than under or over sampling.

For the metric used in your grid search, I would use the f1_score as suggested by @Erwan, it will penalize heavily poor precision and poor recall, and will reward hyper parameters that yield a more balanced model.

Upvotes: 0

Related Questions