max
max

Reputation: 52235

pure python code for multivariate linear regression

Due to a bug (perhaps in the numpy distribution I'm using), I can't use numpy.linalg.lstsq. And every statistics library I found didn't install under python 3 (on Windows).

Does someone have pure python 3 code that would perform a multiple linear regression (I just need the betas)?

If not pure python, I could still try it, if maybe the code happens to not use the same C function that crashes numpy.linalg.lstsq on my machine.

Thanks!

Upvotes: 2

Views: 3025

Answers (1)

yosukesabai
yosukesabai

Reputation: 6244

here is the version using this matlib.py by Ernesto P. Adorio. From him you need

With these following code find coeff of linear regression

from matlib import transpose, mattmat, vec2colmat, mat2vec, matdim, matprint 
from qr import  qr

def readdat():
    f = open('dat','r')
    x, y = [], []
    f.next()
    for line in  f:
        val = line.split()
        y.append(float(val[1]))
        x.append([float(p) for p in val[2:]])
    return x, y


def bsub(r, z):
    """ solves "R b = z", where r is triangular"""
    m, n = matdim(r)
    p, q = matdim(z)
    b = [[0] * n]
    pp, qq = matdim(b)
    for j in range(n-1, -1, -1):
        zz = z[0][j] - sum(r[j][k]*b[0][k] for k in range(j+1, n))
        b[0][j] = zz / r[j][j]
    return b

def linreg(y, x):

    # prepend x with 1
    for xx in x:
        xx.insert(0, 1.0)    

    # QR decomposition
    q, r = qr(x)

    # z = Q^T y
    z = mattmat(q, vec2colmat(y))

    # back substitute to find b in R b = z
    b = bsub(r, transpose(z))
    b = b[0]

    return b


def tester():
    # read test data
    x, y = readdat()

    # calculate coeff
    b = linreg(y, x)

    for i,coef in enumerate(b):
        print 'coef b%d: %f' % (i, coef)

if __name__ == "__main__":
    tester()

Took test data from here: Multiple Regression in Data Mining, which looks like

Case Y X1 X2 X3 X4 X5 X6
  1 43 51 30 39 61 92 45
  2 63 64 51 54 63 73 47
  3 71 70 68 69 76 86 48
  4 61 63 45 47 54 84 35
  5 81 78 56 66 71 83 47
  6 43 55 49 44 54 49 34
  7 58 67 42 56 66 68 35
  8 71 75 50 55 70 66 41
  9 72 82 72 67 71 83 31
 10 67 61 45 47 62 80 41
 11 64 53 53 58 58 67 34
 12 67 60 47 39 59 74 41
 13 69 62 57 42 55 63 25
 14 68 83 83 45 59 77 35
 15 77 77 54 72 79 77 46
 16 81 90 50 72 60 54 36
 17 74 85 64 69 79 79 63
 18 65 60 65 75 55 80 60
 19 65 70 46 57 75 85 46
 20 50 58 68 54 64 78 52

with sample output (NOTE: this is not my output, the example's!!)

Multiple R-squared   0.656
Residual SS        738.900
Std. Dev. Estimate   7.539
         Coefficient StdError t-statistic p-value
Constant      13.182   16.746       0.787   0.445
      X1       0.583    0.232       2.513   0.026
      X2      -0.044    0.167      -0.263   0.797
      X3       0.329    0.219       1.501   0.157
      X4      -0.057    0.317      -0.180   0.860
      X5       0.112    0.196       0.570   0.578
      X6      -0.197    0.247      -0.798   0.439

The code above printed this. Need more flipping textbook to do the stdev etc. but got the number i expected for coeffs.

python linreg.py
coef b0: 13.182283
coef b1: 0.583462
coef b2: -0.043824
coef b3: 0.328782
coef b4: -0.057067
coef b5: 0.111868
coef b6: -0.197083

Upvotes: 6

Related Questions