Lukasz
Lukasz

Reputation: 2606

Multiple linear regression scikit-learn and statsmodel

I'm attempting to use multiple linear regression on a data set using scikit-learn, but I am having trouble with getting the correct coefficients. I'm using the Lake Huron data which can be found here:

https://vincentarelbundock.github.io/Rdatasets/datasets.html

and after transforming it, I have the following set of values:

         x1        x2        y
0  0.202165  1.706366  0.840567
1  1.706366  0.840567  0.694768
2  0.840567  0.694768 -0.291031
3  0.694768 -0.291031  0.333170
4 -0.291031  0.333170  0.387371
5  0.333170  0.387371  0.811572
6  0.387371  0.811572  1.415773
7  0.811572  1.415773  1.359974
8  1.415773  1.359974  1.504176
9  1.359974  1.504176  1.768377
...  ...       ...       ...

using

df = pd.DataFrame(nvalues, columns=("x1", "x2", "y"))
result = sm.ols(formula="y ~ x2 + x1", data=df).fit()

print(result.params)

yields

Intercept   -0.007852
y2           1.002137
y1          -0.283798

which are the correct values, but if I end up using scikit-learn I get:

a = np.array([nvalues["x1"], nvalues["x2"]])
b = np.array(nvalues["y"])

a = a.reshape(len(nvalues["x1"]), 2)
b = b.reshape(len(nvalues["y"]), 1)

clf = linear_model.LinearRegression()
clf.fit(a, b)

print(clf.coef_)

I get [[-0.18260922 0.08101687]].

For completeness my code

from sklearn import linear_model

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.formula.api as sm

def Main():
    location = r"~/Documents/Time Series/LakeHuron.csv"
    ts = pd.read_csv(location, sep=",", parse_dates=[0], header=0)

    #### initializes the data ####
    ts.drop("Unnamed: 0", axis=1, inplace=True)

    x = ts["time"].values
    y = ts["LakeHuron"].values

    x = x.reshape(len(ts), 1)
    y = y.reshape(len(ts), 1)

    regr = linear_model.LinearRegression()
    regr.fit(x, y)

    diff = []
    for i in range(0, len(ts)):
        diff.append(float(ts["LakeHuron"][i]-regr.predict(x)[i]))

    ts[3] = diff

    nvalues = {"x1": [], "x2": [], "y": []}

    for i in range(0, len(ts)-2):
        nvalues["x1"].append(float(ts[3][i]))
        nvalues["x2"].append(float(ts[3][i+1]))
        nvalues["y"].append(float(ts[3][i+2]))

    df = pd.DataFrame(nvalues, columns=("x1", "x2", "y"))
    result = sm.ols(formula="y ~ x2 + x1", data=df).fit()

    print(result.params)

    #### using scikit-learn ####
    a = np.array([nvalues["x1"], nvalues["x2"]])
    b = np.array(nvalues["y"])

    a = a.reshape(len(nvalues["x1"]), 2)
    b = b.reshape(len(nvalues["y"]), 1)

    clf = linear_model.LinearRegression()
    clf.fit(a, b)

    print(clf.coef_)

if __name__ == "__main__":
    Main()

Upvotes: 2

Views: 4550

Answers (2)

Lukasz
Lukasz

Reputation: 2606

as per @Orange suggestion, I've changed the code to something I thing is more efficient:

#### using scikit-learn ####
a = []
for i in range(0, len(nvalues["x1"])):
    a.append([nvalues["x1"][i], nvalues["x2"][i]])

a = np.array(a)
b = np.array(nvalues["y"])

a = a.reshape(len(a), 2)
b = b.reshape(len(nvalues["y"]), 1)

clf = linear_model.LinearRegression()
clf.fit(a, b)

print(clf.coef_) 

Which is a similar to the example on the scikit-learn site for simple regression

Upvotes: 0

0range
0range

Reputation: 2158

The problem is the line

a = np.array([nvalues["x1"], nvalues["x2"]])

as it does not sort the data the way you intend. Instead it will generate a dataset

x1_new    x2_new
-----------------
 x1[0]     x1[1]
 x1[2]     x1[3]
[...]
 x1[94]    x1[95]
 x2[0]     x2[1]
[...]

Try instead

ax1 = np.array(nvalues["x1"])
ax2 = np.array(nvalues["x2"])
ax1 = ax1.reshape(len(nvalues["x1"]), 1)
ax2 = ax2.reshape(len(nvalues["x2"]), 1)
a = np.hstack([ax1,ax2])

There is probably a cleaner way to do that, but this way it works. The regressions now also give all the correct result.

EDIT: The cleaner way is to use transpose():

a = a.transpose()

Upvotes: 2

Related Questions