Linear regression in python goes seriously wrong with completely wrong regression line

Question

Hi everyone I was practicing linear regression on a dataset from kaggle (https://www.kaggle.com/sohier/calcofi, bottle.csv), and I try to implement it in the following way:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df = pd.read_csv("bottle.csv")
df

df1 = df.loc[:,"T_degC":"Salnty"]
df1 = df1.dropna()

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
X = df1["T_degC"]
y = df1["Salnty"]
X = X.values
type(X)
y = y.values
type(y)


X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.4)
lm = LinearRegression()

X_train = X_train.reshape(-1,1)
X_test = X_test.reshape(-1,1)
y_train = y_train.reshape(-1,1)

lm.fit(X_train, y_train)

The problem occurs when I look at the intercepts and the coefficient, which are:

lm.intercept_
lm.coef_

which turn out to be 34.4 and -0.05 respectively. But then consider the scatter plot of the X and y variables:

plt.scatter(X_train, y_train)

It definitely does not look like a negatively-sloped line can be the regression line of this distribution. Thus, I wonder what I may have done wrong that leads to this result.

Jondiedoop · Accepted Answer

This is a very interesting case study!

It appears the regression line is in fact right and your eyes (and your plot) are deceiving you.

The scatter plot you're producing looks like this:

Sure looks like a positive slope, right? Right?
Well, no. There are so many points here, that it's impossible to see where the most points are. It might be well the case that the most points would show a downward slope, but they are all on top of each other, and a 'few' other points that are not on top of each other show an upward slope.

A better plot: lower the visual overlap

To test that, I plotted the points with a much lower opacity and a smaller marker size (so the amount of overlap would be reduced:

plt.scatter(X_train, y_train, alpha=0.002, s=1)
plt.show()

Here you can see that in fact most points show a downward slope (although one might also argue a linear correlation is not the best way to model the correlation). Remember that the Linear Regression tries to fit the best straight line, which means it follows most points, but will not be able to capture a more difficult pattern that is not straight if there are only a few outliers there.

In fact the linear correlation coefficient is also negative:

df1[["T_degC", "Salnty"]].corr()
#          T_degC    Salnty
#T_degC  1.000000 -0.505266
#Salnty -0.505266  1.000000

Conclusion

So in short:
1. Your regression line appears to be correct
2. Make sure you're looking at the right plot - if all points are on top of each other, a scatter plot may not be optimal.

Edit: visual confirmation

One more plot: the scatter plot with your regression on top of it:

That seems reasonable (for a straight line), doesn't it?

Perhaps another plot would be more easy to look at with so many points any way:

import seaborn as sns
sns.jointplot(x='T_degC', y='Salnty', data=df1, kind='hex')

The jointplot explicitly visualizes the overlap by coloring the parts of the graph where there are many points more strongly. This again confirms there is a downward trend, but there are a (relatively small) number of other points that go against that trend. Hope that helps!

Linear regression in python goes seriously wrong with completely wrong regression line

Answers (1)

A better plot: lower the visual overlap

Conclusion

Edit: visual confirmation

Related Questions