Alexander
Alexander

Reputation: 4655

Feature selection accuracy never improve above %0.1 in RandomForest

I've imbalanced dataset and I applied RandomOverSampler to get balanced data set.

oversample = RandomOverSampler(sampling_strategy='minority')
X_over, y_over = oversample.fit_resample(X, y)

Afterwards I have followed this kaggle post RandomForest implementation for feature selection

https://www.kaggle.com/gunesevitan/titanic-advanced-feature-engineering-tutorial (go to the bottom of the page you will see similar implementation.)

I have a similar real data set like titanic:) and trying to get feature importances out of it!

The problem I'm having is that even though the classifier accuracy is very high ~0.99% the feature importance I'm getting is in the order of ~0.1%. What would be causing that? or its ok?

enter image description here

Here is the code I'm using, similar one that I provided in the link. Go to the bottom of the page.

classifiers = [RandomForestClassifier(random_state=SEED,
                                      criterion='gini',
                                      n_estimators=20,
                                      bootstrap=True,
                                      max_depth=5,
                                      n_jobs=-1)]
    
              #DecisionTreeClassifier(),
              #LogisticRegression(),
              #KNeighborsClassifier()]
              #GradientBoostingClassifier(),
              #SVC(probability=True), GaussianNB()]

log_cols = ["Classifier", "Accuracy"]
log      = pd.DataFrame(columns=log_cols)

SEED = 42
N = 15
skf = StratifiedKFold(n_splits=N, random_state=None, shuffle=True)

importances = pd.DataFrame(np.zeros((X.shape[1], N)), columns=['Fold_{}'.format(i) for i in range(1, N + 1)], index=data.columns)


acc_dict = {}

for fold, (train_index, test_index) in enumerate(skf.split(X_over, y_over)):
    X_train, X_test = X_over[train_index], X_over[test_index]
    y_train, y_test = y_over[train_index], y_over[test_index]
    
    for clf in classifiers:
        #pipe1=make_pipeline(sampling,clf)
        print(clf)
        name = clf.__class__.__name__
        clf.fit(X_train, y_train)
        train_predictions = clf.predict(X_test)
        acc = accuracy_score(y_test, train_predictions)
        
        
        if 'Random' in name:
            importances.iloc[:, fold - 1] = clf.feature_importances_
       
    
        if name in acc_dict:
            acc_dict[name] += acc
        else:
            acc_dict[name] = acc
        
        #doing grid search for best input parameters for RF
        #CV_rfc = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 5)
        #CV_rfc.fit(X_train, y_train)
        

for clf in acc_dict:
    acc_dict[clf] = acc_dict[clf] / 10.0
    log_entry = pd.DataFrame([[clf, acc_dict[clf]]], columns=log_cols)
    log = log.append(log_entry)

I'm getting almost the same feature importance value best is ~0.1%

enter image description here

By doing confusion Matrix check suggested from @AlexSerraMarrugat

EDIT

enter image description here Test: 0.9926166568222091 Train: 0.9999704661911724

EDIT2

Tried randomoversplit afterwards:

from imblearn.over_sampling import RandomOverSampler
oversample = RandomOverSampler(sampling_strategy='minority')
x_over, y_over = oversample.fit_resample(X_train,Y_train)
# summarize class distribution
print(Counter(y_over))
print(len(x_over))

#Creating confusion matrix

from sklearn.metrics import plot_confusion_matrix
clf = RandomForestClassifier(random_state=0) #Here change the hyperparameters
clf.fit(x_over, y_over)
predict_y=clf.predict(x_test)
plot_confusion_matrix(clf, x_test, y_test, cmap=plt.cm.Blues)
print("Test: ", clf.score(x_test, y_test))
print("Train: ", clf.score(x_over, y_over))

Test: 0.9926757235676315 Train: 1.0

enter image description here

EDIT3 Confusion matrix for Train data

from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(clf, X_train, Y_train, cmap=plt.cm.Blues)
print("Train: ", clf.score(X_train, Y_train))

enter image description here

Upvotes: 0

Views: 426

Answers (1)

Alex Serra Marrugat
Alex Serra Marrugat

Reputation: 2042

First of all, as Gaussian Prior said, you have to oversample only to your train dataset. Then, once you have the model trained, test the accuracy with your data set.

If I have understood you, you have 0.1% accuracy now with your test data. Please, check if you are overfitting (If accuracy train dataset is much bigger than accuracy test data, it indicates that probably there is overfitting). Try change some hyperparameters. Use this code:

clf = RandomForestClassifier(random_state=0) #Here change the hyperparameters
clf.fit(X_train, y_train)
predict_y=clf.predict(X_test)
plot_confusion_matrix(clf, X_test, y_test, cmap=plt.cm.Blues)
print("Test: ", clf.score(X_test, y_test))
print("Train: ", clf.score(X_train, y_train))

About feature importance. I suspect that your results are correct. They are saying that you have 5 features that are the most important for your model. In my opinion, you have one of the best outputs, where you have a few important features.

You only will obtain a single big values if there's only one unique important feature (the model only obtain information from one features, which is not good at all).

Upvotes: 1

Related Questions