Reputation: 79
I have a working classifier with a dataset split in a train set (70%) and a test set (30%).
However, I'd like to implement a validation set as well (so that: 70% train, 20% validation and 10% test). The sets should be randomly chosen and the results should be averaged over 10 different assignments.
Any ideas how to do this? Below is my implementation using a train and test set only:
def classifier(samples):
# load the datasets
dataset = samples
data_train, data_test, target_train, target_test = train_test_split(dataset["data"], dataset["target"], test_size=0.30, random_state=42)
# fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(data_train, target_train)
print(model)
# make predictions
expected = target_test
predicted = model.predict(data_test)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
Upvotes: 2
Views: 5058
Reputation: 76297
For what you're describing, you just need to use train_test_split
with a following split on its results.
Adapting the tutorial there, start with something like this:
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
iris = datasets.load_iris()
iris.data.shape, iris.target.shape
((150, 4), (150,))
Then, just like there, make the initial train/test partition:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.1, random_state=0)
Now you just need to split the 0.9 of the train data into two more parts:
X_train_cv_train, X_test_cv_train, y_train_cv_train, y_test_cv_train = \
cross_validation.train_test_split(X_train, y_train, test_size=0.2/0.9)
If you want 10 random train/test cv sets, repeat the last line 10 times (this will give you sets with overlap).
Alternatively, you could replace the last line with 10-fold validation (see the relevant classes).
The main point is to build the CV sets from the train part of the initial train/test partition.
Upvotes: 2
Reputation: 1404
For k-fold cross validation (note that this is not the same k as your kNN classifier), divide your training set up into k sections. Let's say 5 as a starting point. You'll create 5 models on your training data, each one tested against a portion. What this means is that your model will have been both trained and tested against each data point in your training set. Wikipedia has a much more detailed description of cross-validation than I've given here.
You can then test against your test set, adjust as necessary, and finally check against your validation set.
Scikit Learn has a well documented method for this.
Upvotes: 1