user3486773
user3486773

Reputation: 1246

How do I run test data through my Python Machine Learning Model?

So I have finally completed my first machine learning model in Python. Initially I take a data set and split it like such:

# Split-out validation dataset
array = dataset.values
X = array[:,2:242]
Y = array[:,1]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

And so you can see I'm going to use 20% of my data to validate with. But once the model is built, I would like to validate/test it with data that it has never touched before. Do I simply make the same X,Y arrays and make the validation_size = 1? I'm stuck on how to test it without retraining it.

models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
#models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = model_selection.KFold(n_splits=12, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)


lr = LogisticRegression()
lr.fit(X_train, Y_train)
predictions = lr.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

I can run data through the model, and return a prediction, but how do I test this on 'new' historical data?

I can do something like this to predict: lr.predict([[5.7,...,2.5]])

but not sure how to pass a test data set thru and get a confusion_matrix / classification_report.

Upvotes: 1

Views: 4398

Answers (2)

gwenevere05
gwenevere05

Reputation: 61

[question]: I can run data through the model, and return a prediction, but how do I test this on 'new' historical data?

If you check out my project below you can see how I have trained and tested my data. I personally would never test all of my data. https://github.com/wendysegura/Portland_Forecasting/blob/master/CSV_Police_Files/Random%20Forest%202012-2016.ipynb

General form for sklearn model classes and methods.

  1. model = base_models.AnySKLearnObject()
    • create an instance of an estimator class
  2. model.fit(train_X, train_y)
    • train your model; also called “fitting your data”
  3. model.score(train_X, train_y)
    • score your model using the training data using the default scoring method(recommended to use the metrics module in the future)
  4. model.predict(test_X)
    • predict your test data
  5. model.score(test_X, test_y)
    • score your model using your test data
  6. model.predict(new_X)
    • make predictions for a new set of data

Upvotes: 2

egorlitvinenko
egorlitvinenko

Reputation: 2776

But once the model is built, I would like to validate/test it with data that it has never touched before.

The reason that you use to split data for train and test (validation) is to run model on data, which is not participated in train set. So your model should not use your test set for learning and don't touch it.

Sometimes, if you want to compare with another test set, you could extract two test sets (with the same method), for example (50%, 25%, 25%), or (70%, 15%, 15%), etc., depends of distribution of your data.

I can run data through the model, and return a prediction, but how do I test this on 'new' historical data?

You use predict method. But when you have "new" data you don't have validation dataset, because you can't know validation dataset for new data. This is why machine learning works with probability, accuracies and other metrics, which can show you how good it would be work on "new" data.

Upvotes: 0

Related Questions