Reputation: 73
I have the following unbalanced data set with two features (keon i.e. gender and alder i.e. age) that was balanced using under_sampling method which I trained on different classifier to predict the call_ending_reason where 0 is No and 1 is Yes:
The balanced dataset with both 1 and 0 have same kind of distribution which can be visualized like this:
However, after performing under_sampling method on the above shown dataset and training both type of dataset in various classifier from sklearn, the balanced dataset is detecting 1s high precision but 0s with very low precision. The opposite happens when I use the main dataset.
Here is the code:
x = filtered_data_limited_features_with_yes_no
y = filtered_data_limited_features_with_yes_no['call_ending_reason']
del x['call_ending_reason']
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size = 0.80)
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
# rfc = MLPClassifier(verbose=True,hidden_layer_sizes=(100,50,10),learning_rate='constant',learning_rate_init=0.0001, n_iter_no_change=50, max_iter=100)
# rfc = GaussianNB()
rfc=RandomForestClassifier()
param_grid = {
'n_estimators': [50,100,200,500],
'max_features': ['auto', 'sqrt', 'log2'],
'criterion' :['gini', 'entropy']
}
CV_rfc_all_data = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 10)
# rfc = LinearSVC()
CV_rfc_all_data.fit(X_train, y_train)
from sklearn.metrics import classification_report
print(classification_report(y_test, CV_rfc_all_data.predict(X_test)))
from imblearn.under_sampling import RandomUnderSampler
ros = RandomUnderSampler( random_state=1)
df_balanced, balanced_labels = ros.fit_resample(x, y)
####TRAINING AND PREDICTING CLASSIFIER BASED ON BALANCED DATASET
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(df_balanced, balanced_labels, train_size = 0.70)
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
# rfc=RandomForestClassifier()
# param_grid = {
# 'n_estimators': [50,100,200,500],
# 'max_features': ['auto', 'sqrt', 'log2'],
# 'criterion' :['gini', 'entropy']
# }
# CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 10)
# CV_rfc = MLPClassifier(verbose=True,hidden_layer_sizes=(100,50,10),learning_rate='invscaling',learning_rate_init=0.0003, n_iter_no_change=50, max_iter=100)
CV_rfc = DecisionTreeClassifier()
CV_rfc.fit(X_train, y_train)
# CV_rfc.best_params_
Questions:
Given the visualization:
Upvotes: 1
Views: 91
Reputation: 1547
You can try to set the class_weight="balanced"
argument the models, it is supported in most of the models that are supported by scikit-learn It won't be magic, but in my experience, it usually works better than under or over sampling.
For the metric used in your grid search, I would use the f1_score as suggested by @Erwan, it will penalize heavily poor precision and poor recall, and will reward hyper parameters that yield a more balanced model.
Upvotes: 0