Reputation: 1175
I have a house-sales dataset and on that, I am applying linear regression. After getting, slope and y-intercept, I plot the graph and compute cost and the result I get is little odd to me, because
Here's the code for plotting the straight line
def plotLine(slope, yIntercept, X, y):
abline_values = [slope * i + yIntercept for i in X]
plt.scatter(X, y)
plt.plot(X, abline_values, 'black')
plt.title(slope)
plt.show()
Following is the function for computing cost
def computeCost(m, parameters, x, y):
[yIntercept, slope] = parameters
hypothesis = yIntercept - np.dot(x, slope)
loss = hypothesis - y
cost = np.sum(loss ** 2) / (2 * m)
return cost
And following lines of code gives me the x vs y plot with the line from computed parameters (for the sake of simplicity of this question, I've manually set the parameters) and cost value.
yIntercept = -70000
slope = 0.85
print("Starting gradient descent at b = %d, m = %f, error = %f" % (yIntercept, slope, computeCost(m, parameters, X, y)))
plotLine(slope, yIntercept, X, y)
And the output of above snippet is
So, my questions are:
1. Is this the right way to plot straight line over x vs y plot?
2. Why cost value is too big, and is it possible to have cost value to be so big even parameters are fitting data well.
Edit 1
The m in print statement is slope value and not size of X, i.e, len(X)
Upvotes: 0
Views: 3655
Reputation: 149
The error value is large due to the unnormalized input data. According to your code, x
varies from 0 to 250k. In this case, I would suggest that you normalize x
to be in [0, 1]. With that, I would expect that the loss is small, and so are the learnt parameters (slope and intercept).
Upvotes: 1
Reputation: 1689
1. Your way to plot seems right, you can probably simplify
abline_values = [slope * i + yIntercept for i in X]
to
abline_values = slope * X + yIntercept
2. Did you set m=0.85
in your example? It seems so, but I can not tell since you did not provide the call to the cost function. Shouldn't it be the size of the sample? If you add up all the squared errors and divide them by 2*0.85, the size of the error depends on your sample size. And since it is not a relative error and the values are rather big, it is possible that all these errors add up to that huge number. Try to set m to the size of your sample.
In addition there is an error in the sign of the computation of the hypothesis value, it should be a +. Otherwise you would have a negative slope, which explains large errors as well.
def computeCost(parameters, x, y):
[yIntercept, slope] = parameters
hypothesis = yIntercept + np.dot(x, slope)
loss = hypothesis - y
cost = np.sum(loss ** 2) / (2 * len(x))
return cost
Upvotes: 2