Reputation: 29
I have a linear regression model and my cost function is a Sum of Squares Error function. I've split my full dataset into three datasets, training, validation, and test. I am not sure how to calculate the training error and validation error (and the difference between the two).
Is the training error the Residual Sum of Squares error calculated using the training dataset?
An example of what I'm asking: So if I was doing this in Python, and let's say I had 90 data-points in the training data set, then is this the correct code for the training error?
y_predicted = f(X_train, theta) #predicted y-value at point x, where y_train is the actual y-value at x
training_error = 0
for i in range(90):
out = y_predicted[i] - y_train[i]
out = out*out
training_error+=out
training_error = training_error/2
print('The training error for this regression model is:', training_error)
Upvotes: 2
Views: 16894
Reputation: 771
This is mentioned in a comment on the post but you need to divide by the total number of samples to get a number that you can compare between validation and test sets.
Simply changed the code would be:
y_predicted = f(X_train, theta) #predicted y-value at point x, where y_train is the actual y-value at x
training_error = 0
for i in range(90):
out = y_predicted[i] - y_train[i]
out = out*out
training_error+=out
#change 2 to 90
training_error = training_error/90
print('The training error for this regression model is:', training_error)
The goal of this is so you can compare two different subsets of data using the same metric. You had a divide by 2 in there which was ok as well as long as you are also dividing by the number of samples.
Another way you can do this in Python is by using the sci-kit learn library, it already has the function.
see below.
from sklearn.metrics import mean_squared_error
training_error = mean_squared_error(y_train,y_predicted)
Also generally when making calculations like this it is better and faster to use matrix multiplication instead of a for loop. In the context, of this question 90 records is quite small but when you start working with larger sample sizes you could try something like this utilizing numpy.
import numpy as np
training_error = np.mean(np.square(np.array(y_predicted)-np.array(y_train)))
All 3 ways should get you similar results.
Upvotes: 1