user8390325
user8390325

Reputation:

Linear regression using normal equation

I am attempting to perform linear regression on height vs age data found in exercise #2 of Andrew Ng's machine learning course http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex2/ex2.html.

The x and y training examples are given in two .dat files and only one feature is used to describe each sample. The age in years and height in feet data are separated by newlines in this format

height-x
2.0658746
2.3684087
2.5399929
2.5420804

age-y
7.7918926
9.1596757
9.0538354
9.0566138

x0 = 1 is used for the intercept as per convention. The issue is with finding parameters theta by using the normal equation: theta = inv(X^T * X) * X^T * y

The output of my program gives parameters [[nan],[0.]] whereas it should be theta0 = 0.7502 and theta1 = 0.0639. I am not sure what I am doing wrong. My code is below.

import numpy as np

X_array = np.fromfile('ex2x.dat', dtype=float)

y_array = np.fromfile('ex2y.dat', dtype=float)

def normal_equation(X, y):

    m = len(X)

    bias_vector = np.ones((m,1))

    X = np.reshape(X, (m, 1))

    X = np.append(bias_vector, X, axis=1)

    y = np.reshape(y, (m, 1))

    X_transpose = X.T

    theta = np.linalg.inv(X_transpose.dot(X))
    theta = theta.dot(X_transpose)
    theta = theta.dot(y)

    return theta

theta = normal_equation(X_array, y_array)
print(theta)

Upvotes: 0

Views: 901

Answers (1)

Mark
Mark

Reputation: 92440

You are reading the files incorrectly. Look at the files, and you will see that there are only 50 rows, yet m is 106 in your code. The reason is that when you use np.fromfile() without specifying a separator, binary is assumed. But you probably don't want that with these files.

Try changing your import to:

array = np.fromfile('path/to/ex2x.dat', sep=' ', dtype=float)

From numpy docs:

https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html

sep : str Separator between items if file is a text file. Empty (“”) separator means the file should be treated as binary. Spaces (” ”) in the separator match zero or more whitespace characters. A separator consisting only of spaces must match at least one whitespace.

With this change, your code now returns:

[[ 0.75016254]
[ 0.06388117]]

Upvotes: 1

Related Questions