Mukul Jain
Mukul Jain

Reputation: 1175

Linear Regression - mean square error coming too large

I have a house-sales dataset and on that, I am applying linear regression. After getting, slope and y-intercept, I plot the graph and compute cost and the result I get is little odd to me, because

  1. Line from parameters is fitting the data well
  2. But the cost value from the same parameter is huge

Here's the code for plotting the straight line

def plotLine(slope, yIntercept, X, y):
  abline_values = [slope * i + yIntercept for i in X]
  plt.scatter(X, y)
  plt.plot(X, abline_values, 'black')
  plt.title(slope)
  plt.show()

Following is the function for computing cost

def computeCost(m, parameters, x, y):
  [yIntercept, slope] = parameters
  hypothesis = yIntercept - np.dot(x, slope)
  loss = hypothesis - y
  cost = np.sum(loss ** 2) / (2 * m)
  return cost

And following lines of code gives me the x vs y plot with the line from computed parameters (for the sake of simplicity of this question, I've manually set the parameters) and cost value.

yIntercept = -70000
slope = 0.85
print("Starting gradient descent at b = %d, m = %f, error = %f" % (yIntercept, slope, computeCost(m, parameters, X, y)))
plotLine(slope, yIntercept, X, y)

And the output of above snippet is

enter image description here

So, my questions are:

1. Is this the right way to plot straight line over x vs y plot?

2. Why cost value is too big, and is it possible to have cost value to be so big even parameters are fitting data well.

Edit 1

The m in print statement is slope value and not size of X, i.e, len(X)

Upvotes: 0

Views: 3655

Answers (2)

Cuong
Cuong

Reputation: 149

The error value is large due to the unnormalized input data. According to your code, x varies from 0 to 250k. In this case, I would suggest that you normalize x to be in [0, 1]. With that, I would expect that the loss is small, and so are the learnt parameters (slope and intercept).

Upvotes: 1

CodeZero
CodeZero

Reputation: 1689

1. Your way to plot seems right, you can probably simplify

abline_values = [slope * i + yIntercept for i in X]

to

abline_values = slope * X + yIntercept

2. Did you set m=0.85 in your example? It seems so, but I can not tell since you did not provide the call to the cost function. Shouldn't it be the size of the sample? If you add up all the squared errors and divide them by 2*0.85, the size of the error depends on your sample size. And since it is not a relative error and the values are rather big, it is possible that all these errors add up to that huge number. Try to set m to the size of your sample. In addition there is an error in the sign of the computation of the hypothesis value, it should be a +. Otherwise you would have a negative slope, which explains large errors as well.

def computeCost(parameters, x, y):
    [yIntercept, slope] = parameters
    hypothesis = yIntercept + np.dot(x, slope)
    loss = hypothesis - y
    cost = np.sum(loss ** 2) / (2 * len(x))
    return cost

Upvotes: 2

Related Questions