Reputation: 2606
I'm attempting to use multiple linear regression on a data set using scikit-learn, but I am having trouble with getting the correct coefficients. I'm using the Lake Huron data which can be found here:
https://vincentarelbundock.github.io/Rdatasets/datasets.html
and after transforming it, I have the following set of values:
x1 x2 y
0 0.202165 1.706366 0.840567
1 1.706366 0.840567 0.694768
2 0.840567 0.694768 -0.291031
3 0.694768 -0.291031 0.333170
4 -0.291031 0.333170 0.387371
5 0.333170 0.387371 0.811572
6 0.387371 0.811572 1.415773
7 0.811572 1.415773 1.359974
8 1.415773 1.359974 1.504176
9 1.359974 1.504176 1.768377
... ... ... ...
using
df = pd.DataFrame(nvalues, columns=("x1", "x2", "y"))
result = sm.ols(formula="y ~ x2 + x1", data=df).fit()
print(result.params)
yields
Intercept -0.007852
y2 1.002137
y1 -0.283798
which are the correct values, but if I end up using scikit-learn I get:
a = np.array([nvalues["x1"], nvalues["x2"]])
b = np.array(nvalues["y"])
a = a.reshape(len(nvalues["x1"]), 2)
b = b.reshape(len(nvalues["y"]), 1)
clf = linear_model.LinearRegression()
clf.fit(a, b)
print(clf.coef_)
I get [[-0.18260922 0.08101687]]
.
For completeness my code
from sklearn import linear_model
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.formula.api as sm
def Main():
location = r"~/Documents/Time Series/LakeHuron.csv"
ts = pd.read_csv(location, sep=",", parse_dates=[0], header=0)
#### initializes the data ####
ts.drop("Unnamed: 0", axis=1, inplace=True)
x = ts["time"].values
y = ts["LakeHuron"].values
x = x.reshape(len(ts), 1)
y = y.reshape(len(ts), 1)
regr = linear_model.LinearRegression()
regr.fit(x, y)
diff = []
for i in range(0, len(ts)):
diff.append(float(ts["LakeHuron"][i]-regr.predict(x)[i]))
ts[3] = diff
nvalues = {"x1": [], "x2": [], "y": []}
for i in range(0, len(ts)-2):
nvalues["x1"].append(float(ts[3][i]))
nvalues["x2"].append(float(ts[3][i+1]))
nvalues["y"].append(float(ts[3][i+2]))
df = pd.DataFrame(nvalues, columns=("x1", "x2", "y"))
result = sm.ols(formula="y ~ x2 + x1", data=df).fit()
print(result.params)
#### using scikit-learn ####
a = np.array([nvalues["x1"], nvalues["x2"]])
b = np.array(nvalues["y"])
a = a.reshape(len(nvalues["x1"]), 2)
b = b.reshape(len(nvalues["y"]), 1)
clf = linear_model.LinearRegression()
clf.fit(a, b)
print(clf.coef_)
if __name__ == "__main__":
Main()
Upvotes: 2
Views: 4550
Reputation: 2606
as per @Orange suggestion, I've changed the code to something I thing is more efficient:
#### using scikit-learn ####
a = []
for i in range(0, len(nvalues["x1"])):
a.append([nvalues["x1"][i], nvalues["x2"][i]])
a = np.array(a)
b = np.array(nvalues["y"])
a = a.reshape(len(a), 2)
b = b.reshape(len(nvalues["y"]), 1)
clf = linear_model.LinearRegression()
clf.fit(a, b)
print(clf.coef_)
Which is a similar to the example on the scikit-learn site for simple regression
Upvotes: 0
Reputation: 2158
The problem is the line
a = np.array([nvalues["x1"], nvalues["x2"]])
as it does not sort the data the way you intend. Instead it will generate a dataset
x1_new x2_new
-----------------
x1[0] x1[1]
x1[2] x1[3]
[...]
x1[94] x1[95]
x2[0] x2[1]
[...]
Try instead
ax1 = np.array(nvalues["x1"])
ax2 = np.array(nvalues["x2"])
ax1 = ax1.reshape(len(nvalues["x1"]), 1)
ax2 = ax2.reshape(len(nvalues["x2"]), 1)
a = np.hstack([ax1,ax2])
There is probably a cleaner way to do that, but this way it works. The regressions now also give all the correct result.
EDIT:
The cleaner way is to use transpose()
:
a = a.transpose()
Upvotes: 2