Reputation:
I am trying to make linear regression model that predicts the son's length from his father's length
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.linear_model import LinearRegression
Headings_cols = ['Father', 'Son']
df = pd.read_csv('http://www.math.uah.edu/stat/data/Pearson.txt',
delim_whitespace=True, names=Headings_cols)
X = df['Father']
y = df['Son']
model2 = LinearRegression()
model2.fit(y, X)
plt.scatter(X, y,color='g')
plt.plot(X, model.predict(X),color='g')
plt.scatter(y, X, color='r')
plt.plot(y, X, color='r')
I get error
ValueError: could not convert string to float: 'Father'
The second thing is calculating the average length of the sons, and the standard error of the mean ?
Upvotes: 12
Views: 44721
Reputation: 21
I was looking for the answer to the same question, but the initial dataset URL is no longer valid. The "Father/Son" Pearson height dataset csv can be retrieved from the following URL and then just needs a couple of minor tweaks to work as advertised (note the renaming of the .csv file):
http://www.randomservices.org/random/data/Pearson.html
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import csv
from sklearn.linear_model import LinearRegression
# data retrieved from http://www.randomservices.org/random/data/Pearson.html#
df = pd.read_csv('./pearsons_height_data.csv',
quotechar='"',
quoting=csv.QUOTE_ALL)
df.head() # produce a header from the first data row
# LinearRegression will expect an array of shape (n, 1)
# for the "Training data"
X = df['Father'].values[:,np.newaxis]
# target data is array of shape (n,)
y = df['Son'].values
model2 = LinearRegression()
model2.fit(X, y)
plt.scatter(X, y,color='g')
plt.plot(X, model2.predict(X),color='k')
plt.show()
Upvotes: 2
Reputation: 339052
There are two main issues here:
sklearn.LinearRegression.fit
understands1. Getting the data out
The source file contains a header line with the column names. We do not want to column names in our data, so after reading in the whole data into the dataframe df
, we can tell it to use the first line as headers by
df.head()
. This allows to later query the dataframe by the column names as usual, i.e. df['Father']
.
2. Getting the data into shape
The sklearn.LinearRegression.fit
takes two arguments. First the "training data", which should be a 2D array, and second the "target values". In the case considered here, we simply what to make a fit, so we do not care about the notions too much, but we need to bring the first input to that function into the desired shape. This can be easily done by creating a new axis to one of the arrays, i.e. df['Father'].values[:,np.newaxis]
The complete working skript:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
df = pd.read_csv('http://www.math.uah.edu/stat/data/Pearson.txt',
delim_whitespace=True)
df.head() # prodce a header from the first data row
# LinearRegression will expect an array of shape (n, 1)
# for the "Training data"
X = df['Father'].values[:,np.newaxis]
# target data is array of shape (n,)
y = df['Son'].values
model2 = LinearRegression()
model2.fit(X, y)
plt.scatter(X, y,color='g')
plt.plot(X, model2.predict(X),color='k')
plt.show()
Upvotes: 34
Reputation: 12913
When loading the data, do this instead:
df = pd.read_csv('http://www.math.uah.edu/stat/data/Pearson.txt',
delim_whitespace=True)
df.columns = Headings_cols
You should also make sure X is shaped correctly:
X = df['Father'].values.reshape(-1, 1)
Upvotes: -1