Reputation: 115
So I'm training a Gaussian Bayes classifier and for some reason I am getting perfect on everything on the classification report. Obviously this is wrong, and I was wondering why this is happening. I don't have much experience in the field, so any help is appreciated! Kaggle Kernel Link: https://www.kaggle.com/rafayk7/kickstarter-real
features_train, features_test, target_train, target_test = train_test_split(
data_analyze_scaled,
target,
test_size = 0.2,
random_state=42
)
print(features_train.shape)
print(target_train.shape)
print(features_test.shape)
print(target_test.shape)
Gives
(265337, 254)
(265337,)
(66335, 254)
(66335,)
And then when I train it,
model = GaussianNB()
pred = model.fit(features_train, target_train).predict(features_test)
accuracy = classification_report(pred, target_test)
print(accuracy)
This gives me 1.0 on everything (precicision, recall, accuracy f1) Even a logistic regression model is giving me 100% accuracy. I don't think this is overfitting because it is a flat 100%. any help is appreciated!
Here is a snapshot of the data:
target = data_analyze_scaled['state']
data_analyze_scaled.drop('state', axis=1)
This is my target and data_analyze_scaled that I use in my train_test_split
Upvotes: 3
Views: 141
Reputation: 4264
The error is in data_analyze_scaled.drop('state', axis=1)
.
This statement will remove the column state
from the data frame and returns the modified data frame which has to be saved in another data frame object like:
data_analyze_scaled_x = data_analyze_scaled.drop('state', axis=1)
And now you should use this in your train test split.
In your existing implementation you have given the target variable as a feature to your model. So the accuracy is going to be 1 which ever model you use.
Upvotes: 3