Using scikit-learn LinearRegression to plot a linear fit

I am trying to make linear regression model that predicts the son's length from his father's length

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.linear_model import LinearRegression


Headings_cols = ['Father', 'Son']
df = pd.read_csv('http://www.math.uah.edu/stat/data/Pearson.txt', 
                 delim_whitespace=True, names=Headings_cols)



X = df['Father']  
y = df['Son']  

model2 = LinearRegression()
model2.fit(y, X)

plt.scatter(X, y,color='g')
plt.plot(X, model.predict(X),color='g')

plt.scatter(y, X, color='r')
plt.plot(y, X, color='r')

I get error

ValueError: could not convert string to float: 'Father'

The second thing is calculating the average length of the sons, and the standard error of the mean ?

Upvotes: 12

Answers (3)

D-S

Reputation: 21

I was looking for the answer to the same question, but the initial dataset URL is no longer valid. The "Father/Son" Pearson height dataset csv can be retrieved from the following URL and then just needs a couple of minor tweaks to work as advertised (note the renaming of the .csv file):

http://www.randomservices.org/random/data/Pearson.html

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import csv

from sklearn.linear_model import LinearRegression

# data retrieved from http://www.randomservices.org/random/data/Pearson.html#

df = pd.read_csv('./pearsons_height_data.csv',
                 quotechar='"',
                 quoting=csv.QUOTE_ALL)

df.head() # produce a header from the first data row

# LinearRegression will expect an array of shape (n, 1)
# for the "Training data"
X = df['Father'].values[:,np.newaxis]
# target data is array of shape (n,)
y = df['Son'].values

model2 = LinearRegression()
model2.fit(X, y)

plt.scatter(X, y,color='g')
plt.plot(X, model2.predict(X),color='k')

plt.show()

Upvotes: 2

ImportanceOfBeingErnest

Reputation: 339052

There are two main issues here:

Getting the data out of the source
Getting the data into the shape that sklearn.LinearRegression.fit understands

1. Getting the data out
The source file contains a header line with the column names. We do not want to column names in our data, so after reading in the whole data into the dataframe df, we can tell it to use the first line as headers by
df.head(). This allows to later query the dataframe by the column names as usual, i.e. df['Father'].

2. Getting the data into shape
The sklearn.LinearRegression.fit takes two arguments. First the "training data", which should be a 2D array, and second the "target values". In the case considered here, we simply what to make a fit, so we do not care about the notions too much, but we need to bring the first input to that function into the desired shape. This can be easily done by creating a new axis to one of the arrays, i.e. df['Father'].values[:,np.newaxis]

The complete working skript:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression

df = pd.read_csv('http://www.math.uah.edu/stat/data/Pearson.txt',
                 delim_whitespace=True)
df.head() # prodce a header from the first data row


# LinearRegression will expect an array of shape (n, 1) 
# for the "Training data"
X = df['Father'].values[:,np.newaxis]
# target data is array of shape (n,) 
y = df['Son'].values


model2 = LinearRegression()
model2.fit(X, y)

plt.scatter(X, y,color='g')
plt.plot(X, model2.predict(X),color='k')

plt.show()

Upvotes: 34

Alex

Reputation: 12913

When loading the data, do this instead:

df = pd.read_csv('http://www.math.uah.edu/stat/data/Pearson.txt', 
                 delim_whitespace=True)
df.columns = Headings_cols

You should also make sure X is shaped correctly:

X = df['Father'].values.reshape(-1, 1)

Upvotes: -1

Using scikit-learn LinearRegression to plot a linear fit

Answers (3)

Related Questions