Performing K-fold Cross-Validation: Using Same Training Set vs. Separate Validation Set

Question

I am using the Python scikit-learn framework to build a decision tree. I am currently splitting my training data into two separate sets, one for training and the other for validation (implemented via K-fold cross-validation).

To cross-validate my model, should I split my data into two sets as outlined above or simply use the full training set? My main objective is to prevent overfitting. I have seen conflicting answers online about the use and efficacy of both these approaches.

I understand that K-fold cross-validation is commonly used when there is not enough data for a separate validation set. I do not have this limitation. Intuitively speaking, I believe that employing K-fold cross-validation in conjunction with a separate dataset will further reduce overfitting.

Is my supposition correct? Is there a better approach I can use to validate my model?

Split Dataset Approach:

x_train, x_test, y_train, y_test = train_test_split(df[features], df["SeriousDlqin2yrs"], test_size=0.2, random_state=13)

dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
dt.fit(x_train, y_train)

scores = cross_val_score(dt, x_test, y_test, cv=10)

Training Dataset Approach:

x_train=df[features]
y_train=df["SeriousDlqin2yrs"]

dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
dt.fit(x_train, y_train)

scores = cross_val_score(dt, x_train, y_train, cv=10)

lejlot · Accepted Answer

Ok it seems that you are very confused by both validating as well as what cross_val_score does. First thing first, you should not do any of the above approaches. If you are not searching for some hyperparameters, but instead just want to answer the question "How good is DT with min_samples_split=20 on my data", then you should do:

dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
scores = cross_val_score(dt, X, y, cv=10)

Without any splitting. Why? Because cross_val_score does the splitting. What it does, it splits the X and y into 10 parts and performs 10 times fitting on trianing and then testing on remaining part. In other words if you do something like

x_train=df[features]
y_train=df["SeriousDlqin2yrs"]

dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
dt.fit(x_train, y_train) # this line does nothing!

scores = cross_val_score(dt, x_train, y_train, cv=10)

Then fit command is useless, as cross_val_score will call fit again, 10 times. Furthermore here you do not use test set at all! Similarly in your second code - you would both fit, and test on test set, also incorrect.

However, if you are trying to fit some hyperparameter, lets say this min_samples_split, then you should (assuming that your test set is big enough to be representible):

X_train, y_train = X[train], y[train]
X_test, y_test = X[test], y[test]

scores = []
for param in [10, 20, 40]:
   dt = DecisionTreeClassifier(min_samples_split=param, random_state=99)
   scores.append((cross_val_score(dt, X_train, y_train, cv=10), param))

best_param = max(scores)[1]
dt = DecisionTreeClassifier(min_samples_split=best_param, random_state=99)
print np.mean(dt.predict(X_test)==y_test) # checking accuracy on testing set

Performing K-fold Cross-Validation: Using Same Training Set vs. Separate Validation Set

Answers (1)

Related Questions