IMCoins
IMCoins

Reputation: 3306

Plotting regression line

I have some issues plotting some regression line. My problem is probably that I don't comprehend properly the mathematics done by these functions, so I'm asking here to be sure.

from matplotlib import pyplot as plt
import numpy as np

def estimate_coef(x, y):
    # number of observations/points
    n = np.size(x)

    # mean of x and y vector
    m_x, m_y = np.mean(x), np.mean(y)

    # calculating cross-deviation and deviation about x
    SS_xy = np.sum(y*x - n*m_y*m_x)
    SS_xx = np.sum(x*x - n*m_x*m_x)

    # calculating regression coefficients
    b_1 = SS_xy / SS_xx
    b_0 = m_y - b_1 * m_x

    return (b_0, b_1)

def plot_regression_line(xs, ys):
    # dev stands for deviation
    dev = estimate_coef(xs, ys)

    y_pred = []
    for x in xs:
        y_pred.append(dev[0] + dev[1] * x)

    # plotting the regression line
    plt.plot(xs, y_pred, color = "g")

def main():
    # Defining points.
    xarr = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    yarr = [1, 3, 2, 5, 7, 8, 8, 9, 10, 12]

    # Setting points as numpy arrays.
    # It is more convenient this way for further process.
    x = np.array(xarr)
    y = np.array(yarr)

    # Plotting points.
    plt.scatter(x, y)

    plot_regression_line(x, y)
    plt.show()

if __name__ == "__main__":
    main()

The code above shows a nicely plotted graphic, such as :

Good plot

But... if I reverse the points in my y axis, just to test my function, such as in the main() function I would do :

yarr = [1, 3, 2, 5, 7, 8, 8, 9, 10, 12]
yarr.reverse()

I get the following...

Wrong plot


I obviously want my plot_regression_line function to plot me the line I am awaiting for considering the data I input. And I can't understand why this wouldn't work.

I believe the problem comes from the estimate_coef function, and especially how the b_0 is computed, but I don't know the changes I should apply in order to get my function to work as intended.

Upvotes: 1

Views: 1417

Answers (1)

Mr. T
Mr. T

Reputation: 12410

I don't know, where you got your regression formula from. Wikipedia has a different one. If you transcribe it into your script conventions, it should be

SS_xy = np.sum((x - m_x) * (y - m_y))
SS_xx = np.sum(x*x - m_x*m_x)

which gives you the right regression line for both cases. And you will not need to calculate n any more, because it is already taken into consideration, when you calculate the mean values.

Upvotes: 5

Related Questions