I get 100% when I use random forest on the test set with scikit-learn. Is there something wrong with my model?

Question

I am getting 100% accuracy on my test set when trained using random forest.

Is there something wrong with my model or code?

Code

ds = pd.read_csv('census-income.test(no unk.).csv')

df = pd.read_csv('census-income.data(no unk.).csv')

X = df 
y = df['income']

X_T = ds 
y_T = ds['income']

categorical_preprocessor = Pipeline(steps=[ ("onehot", OneHotEncoder(handle_unknown="ignore")) ])

preprocessor = ColumnTransformer([ ("categorical", categorical_preprocessor, ['workclass','education','martial-status','occupation','relationship','race','sex', 'native-country','income']), ],remainder='passthrough')

pipe = Pipeline(steps=[ ("preprocessor", preprocessor), ("classifier", RandomForestClassifier(n_estimators=128, max_depth=7)) ])

X_train = X 
X_test = X_T 
y_train = y 
y_test = y_T

pipe.fit(X_train, y_train) 
y_pred = pipe.predict(X_test)

print(classification_report(y_test, y_pred, digits=4)) 
print(confusion_matrix(y_test, y_pred))

Training data Picture sample of training data

Test data Picture sample of test data

Confusion matrix

[[11360     0]
 [    0  3700]]

Alexander L. Hayes · Accepted Answer

The labels leaked into the training and test set.

Doing this:

df = pd.read_csv('census-income.data(no unk.).csv')

X = df 
y = df['income']

... means that X still contains the income variable. A model can achieve perfect predictions when "income" in X perfectly correlates with "income" in y.

Something like this is required:

X = df.drop(['income'], axis=1)
y = df['income']

I get 100% when I use random forest on the test set with scikit-learn. Is there something wrong with my model?

Answers (1)

Related Questions