ninesalt
ninesalt

Reputation: 4354

scikit-learn cross validation score in regression

I'm trying to build a regression model, validate and test it and make sure it doesn't overfit the data. This is my code thus far:

from pandas import read_csv
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score, validation_curve
import numpy as np
import matplotlib.pyplot as plt

data = np.array(read_csv('timeseries_8_2.csv', index_col=0))

inputs = data[:, :8]
targets = data[:, 8:]

x_train, x_test, y_train, y_test = train_test_split(
    inputs, targets, test_size=0.1, random_state=2)

rate1 = 0.005
rate2 = 0.1

mlpr = MLPRegressor(hidden_layer_sizes=(12,10), max_iter=700, learning_rate_init=rate1)

# trained = mlpr.fit(x_train, y_train)  # should I fit before cross val?
# predicted = mlpr.predict(x_test)      

scores = cross_val_score(mlpr, inputs, targets, cv=5)
print(scores)

Scores prints an array of 5 numbers where the first number usually around 0.91 and is always the largest number in the array. I'm having a little trouble figuring out what to do with these numbers. So if the first number is the largest number, then does this mean that on the first cross validation attempt, the model scored the highest, and then the scores decreased as it kept trying to cross validate?

Also, should I fit the training the data before I call the cross validation function? I tried commenting it out and it's giving me more or less the same results.

Upvotes: 2

Views: 7056

Answers (1)

fuglede
fuglede

Reputation: 18201

The cross validation function performs the model fitting as part of the operation, so you gain nothing from doing that by hand:

The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):

http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics

And yes, the returned numbers reflect multiple runs:

Returns: Array of scores of the estimator for each run of the cross validation.

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score

Finally, there is no reason to expect that the first result is the largest:

from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn.neural_network import MLPRegressor
boston = datasets.load_boston()
est = MLPRegressor(hidden_layer_sizes=(120,100), max_iter=700, learning_rate_init=0.0001)
cross_val_score(est, boston.data, boston.target, cv=5)

# Output
array([-0.5611023 , -0.48681641, -0.23720267, -0.19525727, -4.23935449])

Upvotes: 2

Related Questions