Reputation: 13
I am getting 100% accuracy on my test set when trained using random forest.
Is there something wrong with my model or code?
Code
ds = pd.read_csv('census-income.test(no unk.).csv')
df = pd.read_csv('census-income.data(no unk.).csv')
X = df
y = df['income']
X_T = ds
y_T = ds['income']
categorical_preprocessor = Pipeline(steps=[ ("onehot", OneHotEncoder(handle_unknown="ignore")) ])
preprocessor = ColumnTransformer([ ("categorical", categorical_preprocessor, ['workclass','education','martial-status','occupation','relationship','race','sex', 'native-country','income']), ],remainder='passthrough')
pipe = Pipeline(steps=[ ("preprocessor", preprocessor), ("classifier", RandomForestClassifier(n_estimators=128, max_depth=7)) ])
X_train = X
X_test = X_T
y_train = y
y_test = y_T
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))
print(confusion_matrix(y_test, y_pred))
Training data Picture sample of training data
Test data Picture sample of test data
Confusion matrix
[[11360 0]
[ 0 3700]]
Upvotes: 1
Views: 65
Reputation: 4273
The labels leaked into the training and test set.
Doing this:
df = pd.read_csv('census-income.data(no unk.).csv')
X = df
y = df['income']
... means that X
still contains the income
variable. A model can achieve perfect predictions when "income" in X
perfectly correlates with "income" in y
.
Something like this is required:
X = df.drop(['income'], axis=1)
y = df['income']
Upvotes: 2