Compute yhat for OLS Regression

Question

I have implemented a method to compute betas for OLS regression in python. Now, I would like to score my model using R^2. For my assignment, I am not allowed to use Python packages to do so, so would have to implement a method from scratch.

#load the data
import numpy as np
import pandas as pd
from numpy.linalg import inv
from sklearn.datasets import load_boston
boston = load_boston()

# Set the X and y variables. 
X = boston.data
y = boston.target

#append ones to my X matrix. 
int = np.ones(shape=y.shape)[..., None]
X = np.concatenate((int, X), 1)

#compute betas. 
betas = inv(X.transpose().dot(X)).dot(X.transpose()).dot(y)

# extract the feature names of the boston data set and prepend the 
#intercept
names = np.insert(boston.feature_names, 0, 'INT')

# collect results into a DataFrame for pretty printing
results = pd.DataFrame({'coeffs':betas}, index=names)

#print the results
print(results)

            coeffs
INT      36.491103
CRIM     -0.107171
ZN        0.046395
INDUS     0.020860
CHAS      2.688561
NOX     -17.795759
RM        3.804752
AGE       0.000751
DIS      -1.475759
RAD       0.305655
TAX      -0.012329
PTRATIO  -0.953464
B         0.009393
LSTAT    -0.525467

Now, I would like to implement a R^2 to score my model on this data (or any other data). (see here: https://en.wikipedia.org/wiki/Coefficient_of_determination)

My issue is I am not entirely certain how to compute the numerator, SSE. In code it would look like this:

#numerator
sse = sum((Y - yhat ** 2)

Where Y are the boston house prices, and yhat are the predicted prices of these houses. However, how do I compute the term, yhat?

Alexander · Accepted Answer

yhat is your estimate for a given observation. You can obtain all of your estimates simultaneously using the dot product via X.dot(betas).

Your sum square of errors would be the following (note the correction to version you gave: you need to square the difference, i.e. square the errors):

y_hat = X.dot(betas)
errors = y - y_hat 
sse = (errors ** 2).sum()

Your total sum of squares:

tss = ((y - y.mean()) ** 2).sum()

And the resulting R-squared (coefficient of determination):

r2 = 1 - sse / tss

Also, I wouldn't use int as a variable name to avoid clobbering the built in int function (just call it ones or const instead).

Compute yhat for OLS Regression

Answers (1)

Related Questions