Data science models achieving very high, unexpected accuracies?

Question

So I'm training a Gaussian Bayes classifier and for some reason I am getting perfect on everything on the classification report. Obviously this is wrong, and I was wondering why this is happening. I don't have much experience in the field, so any help is appreciated! Kaggle Kernel Link: https://www.kaggle.com/rafayk7/kickstarter-real

features_train, features_test, target_train, target_test = train_test_split(
    data_analyze_scaled,
    target, 
    test_size = 0.2,
    random_state=42
)

print(features_train.shape)
print(target_train.shape)
print(features_test.shape)
print(target_test.shape)

Gives

(265337, 254)
(265337,)
(66335, 254)
(66335,)

And then when I train it,

model = GaussianNB()

pred = model.fit(features_train, target_train).predict(features_test)

accuracy = classification_report(pred, target_test)
print(accuracy)

This gives me 1.0 on everything (precicision, recall, accuracy f1) Even a logistic regression model is giving me 100% accuracy. I don't think this is overfitting because it is a flat 100%. any help is appreciated!

Here is a snapshot of the data:

target = data_analyze_scaled['state']
data_analyze_scaled.drop('state', axis=1)

This is my target and data_analyze_scaled that I use in my train_test_split

Parthasarathy Subburaj · Accepted Answer

The error is in data_analyze_scaled.drop('state', axis=1). This statement will remove the column state from the data frame and returns the modified data frame which has to be saved in another data frame object like:

data_analyze_scaled_x = data_analyze_scaled.drop('state', axis=1)

And now you should use this in your train test split.

In your existing implementation you have given the target variable as a feature to your model. So the accuracy is going to be 1 which ever model you use.

Data science models achieving very high, unexpected accuracies?

Answers (1)

Related Questions