Reputation: 169
I am running a Linear Regression (using Gradience Descent Analysis / GDA) using imported data from a .csv file (data_axis
and data
are exported dates and stock market prices respectively.) The code below returns [nan nan nan nan nan nan]
as the theta value. The square error also returns nan
.
Error messages: 'overflow encountered in multiply', 'invalid value encountered in add'
import numpy as np
Xdata = np.array(data_axis)
Xdata = Xdata.reshape(-1,2)
print(Xdata.shape)
Ydata = np.array(data[0:783])
Ydata = Ydata[::-1]
print(Ydata.shape)
def phi(x):
return np.array([1,x[0],x[1],x[0]*x[0],x[1]*x[0],x[1]*x[1]])
def gda(X, Y):
n= len(X)
theta= np.zeros(len(X[0]))
alpha= 0.5
iterations = 1000
for j in range(iterations):
for i in range(len(X)):
theta += alpha*(Y[i] - np.dot(theta,X[i]))*X[i]/n
return theta
def linear_regression_with_features(X, Y, phi):
phi_X = np.array([ phi(x) for x in X])
return gda(phi_X,Y)
theta = linear_regression_with_features(Xdata, Ydata, phi)
def h(theta, x):
return np.dot(theta,phi(x))
error = np.array([Ydata[i]-h(theta,Xdata[i]) for i in range(len(Xdata))])
s_error = np.dot(error,error)
print('theta= ', theta)
print('square error= ', s_error)
plt.plot(Xdata,Ydata,'co')
plt.plot(h(theta,Xdata),'r-')
The code does successfully return and plot a linear regression for randomly generated inputs Xdata = np.random.rand(783,2)
,Ydata = np.array([ 2-4*x[0]+3*x[1]+x[0]*x[0]+2*x[0]*x[1]-3*x[1]*x[1] for x in X])
.
I checked that there are no NaN values in the .csv file. I searched for the error message, and read that some Python & NumPy operations involving extremely small or extremely large numbers may output a NaN value as the result. Could this be my issue? Or is there something else which may be fixed?
Upvotes: 0
Views: 1292
Reputation: 169
I found the problem - it was a combination of very small values, very large values, and input matrices being the wrong shape. Here, X-axis input and Y-axis input had to be shapes (n,2) and (n,) respectively
Rounding did not help, but rearranging the data from scratch did.
Upvotes: 1