Reputation: 1246
So I have finally completed my first machine learning model in Python. Initially I take a data set and split it like such:
# Split-out validation dataset
array = dataset.values
X = array[:,2:242]
Y = array[:,1]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
And so you can see I'm going to use 20% of my data to validate with. But once the model is built, I would like to validate/test it with data that it has never touched before. Do I simply make the same X,Y arrays and make the validation_size = 1? I'm stuck on how to test it without retraining it.
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
#models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=12, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
lr = LogisticRegression()
lr.fit(X_train, Y_train)
predictions = lr.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
I can run data through the model, and return a prediction, but how do I test this on 'new' historical data?
I can do something like this to predict: lr.predict([[5.7,...,2.5]])
but not sure how to pass a test data set thru and get a confusion_matrix / classification_report.
Upvotes: 1
Views: 4398
Reputation: 61
[question]: I can run data through the model, and return a prediction, but how do I test this on 'new' historical data?
If you check out my project below you can see how I have trained and tested my data. I personally would never test all of my data. https://github.com/wendysegura/Portland_Forecasting/blob/master/CSV_Police_Files/Random%20Forest%202012-2016.ipynb
General form for sklearn model classes and methods.
Upvotes: 2
Reputation: 2776
But once the model is built, I would like to validate/test it with data that it has never touched before.
The reason that you use to split data for train and test (validation) is to run model on data, which is not participated in train set. So your model should not use your test set for learning and don't touch it.
Sometimes, if you want to compare with another test set, you could extract two test sets (with the same method), for example (50%, 25%, 25%), or (70%, 15%, 15%), etc., depends of distribution of your data.
I can run data through the model, and return a prediction, but how do I test this on 'new' historical data?
You use predict method. But when you have "new" data you don't have validation dataset, because you can't know validation dataset for new data. This is why machine learning works with probability, accuracies and other metrics, which can show you how good it would be work on "new" data.
Upvotes: 0