Anthony Kubeka
Anthony Kubeka

Reputation: 29

How do you calculate the training error and validation error of a linear regression model?

I have a linear regression model and my cost function is a Sum of Squares Error function. I've split my full dataset into three datasets, training, validation, and test. I am not sure how to calculate the training error and validation error (and the difference between the two).

Is the training error the Residual Sum of Squares error calculated using the training dataset?

An example of what I'm asking: So if I was doing this in Python, and let's say I had 90 data-points in the training data set, then is this the correct code for the training error?

y_predicted = f(X_train, theta) #predicted y-value at point x, where y_train is the actual y-value at x
training_error = 0
for i in range(90):
  out = y_predicted[i] - y_train[i] 
  out = out*out 
  training_error+=out

training_error = training_error/2
print('The training error for this regression model is:', training_error)

Upvotes: 2

Views: 16894

Answers (1)

jawsem
jawsem

Reputation: 771

This is mentioned in a comment on the post but you need to divide by the total number of samples to get a number that you can compare between validation and test sets.

Simply changed the code would be:

y_predicted = f(X_train, theta) #predicted y-value at point x, where y_train is the actual y-value at x
training_error = 0
for i in range(90):
  out = y_predicted[i] - y_train[i] 
  out = out*out 
  training_error+=out

#change 2 to 90 
training_error = training_error/90
print('The training error for this regression model is:', training_error)

The goal of this is so you can compare two different subsets of data using the same metric. You had a divide by 2 in there which was ok as well as long as you are also dividing by the number of samples.

Another way you can do this in Python is by using the sci-kit learn library, it already has the function.

see below.

from sklearn.metrics import mean_squared_error
training_error = mean_squared_error(y_train,y_predicted)

Also generally when making calculations like this it is better and faster to use matrix multiplication instead of a for loop. In the context, of this question 90 records is quite small but when you start working with larger sample sizes you could try something like this utilizing numpy.

import numpy as np

training_error = np.mean(np.square(np.array(y_predicted)-np.array(y_train)))

All 3 ways should get you similar results.

Upvotes: 1

Related Questions