DaPanda
DaPanda

Reputation: 319

Why does shuffling training data affect my random forest classifier's accuracy?

The same question has been asked. But since the OP didn't post the code, not much helpful information was given.

I'm having basically the same problem, where for some reason shuffling data is making a big accuracy gain (from 45% to 94%!) to my random forest classifier. (In my case removing duplicates also affected the accuracy, but that may be a discussion for another day) Based on my understanding on how RF algorithm works, this really should not happen.

My data are merged from several files, each containing the same samples in the same order. For each sample, the first 3 columns are separate outputs, but currently I'm just focusing on the first output.

The merged data looks like this. The output (1st column) is ordered and unevenly distributed: enter image description here

The shuffled data looks like this: enter image description here

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

TOTAL_OUTPUTS = 3

... (code for merging data and feature engineering)

to_compare = {
    "merged": merged,
    "merged shuffled": merged.sample(frac=1.0),
    "merged distinct": merged.drop_duplicates(),
    "merged distinct shuffled": merged.drop_duplicates().sample(frac=1.0)
}


params = {'n_estimators': 300,
          'max_depth': 15,
          'criterion': 'entropy',
          'max_features': 'sqrt'
          }

for name, data_to_compare in to_compare.items():
    features = data_to_compare.iloc[:, TOTAL_OUTPUTS:]
    y = data_to_compare.iloc[:, 0]
    rf = RandomForestClassifier(**params)
    scores = cross_val_score(rf, features, y, cv=3)
    print(name, scores.mean(), np.std(scores))

Output:

merged 0.44977727094363956 0.04442305341799508
merged shuffled 0.9431099584137672 0.0008679933736473513
merged distinct 0.44780773420479303 0.04365860091028133
merged distinct shuffled 0.8486519607843137 0.00042583049485598673

Upvotes: 1

Views: 2380

Answers (1)

Bluexm
Bluexm

Reputation: 157

The unshuffled data you are using shows that values of certain features tend to be constant for some rows. This causes the forest to be weaker because all the individual tress composing it are weaker.

To see that, take an extreme reasoning; if one of the features is constant all along the data set (or if you use a chunk of this dataset where the feature is constant), then this feature brings nothing in entropy changes if selected. so this feature is never selected, and the tree underfits.

Upvotes: 1

Related Questions