Reputation: 992
I'm working on the mushroom classification data set (found here :https://www.kaggle.com/uciml/mushroom-classification)
I've done some pre-processing on the data (removed redundant attributes, changed categorical data to numerical) and I'm trying to use my data to train classifiers.
Whenever I shuffle my data, either manually or by using train_test_split, all of the models which I use (XGB, MLP, LinearSVC, Decision Tree) have 100% accuracy. Whenever I test the models on unshuffled data the accuracy is around 50-85%.
Here are my methods for splitting the data:
x = testing.copy()
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, shuffle=True)
and manually
x = testing.copy()
x = x.sample(frac=1)
testRatio = 0.3
testCount = int(len(x)*testRatio)
x_train = x[testCount:]
x_test = x[0:testCount]
y_train = y[testCount:]
y_test = y[0:testCount]
Is there something I'm doing completely wrong and missing?
Edit: The only difference that I can see when splitting data with and without shuffling the rows is the distribution of the classes.
Without shuffling:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, shuffle=False)
print(y_test.value_counts())
print(y_train.value_counts())
Results in:
0 1828
1 610
Name: class, dtype: int64
1 3598
0 2088
Name: class, dtype: int64
While shuffling:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, shuffle=True)
print(y_test.value_counts())
print(y_train.value_counts())
Results in:
0 1238
1 1200
Name: class, dtype: int64
1 3008
0 2678
Name: class, dtype: int64
I don't see how this would affect the model's accuracy in such a big way though.
Edit2: Following PV8's advice I've tried verifying my results by using cross validation and it seems to do the trick, I'm getting much more reasonable results this way.
model = LinearSVC()
scores = cross_val_score(model,x,y,cv=5)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Output:
[1. 1. 1. 1. 0.75246305]
Accuracy: 0.95 (+/- 0.20)
Upvotes: 1
Views: 390
Reputation: 6260
This can be normal behavior, how many shuffles did you try?
This is indicating that your data is quite fluaktiv to the way you split it. I hope you measured the test accuracy and not the train one?
I would suggest you to use cross validation, this will help you to verify your general results.
Upvotes: 1