Linear Regression with Multiple Variables - Python - Implementation issues

Question

I am trying to implement Linear Regression with Multiple variables( actually , just 2 ) . I am using the data from the ML-Class Stanford. I got it working correctly for the single variable case. The same code should have worked for multiple, but , does not.

LINK to the data :

http://s3.amazonaws.com/mlclass-resources/exercises/mlclass-ex1.zip

Feature Normalization:

''' This is for the regression with multiple variables problem .  You have to normalize features before doing anything. Lets get started'''
from __future__ import division
import os,sys
from math import *

def mean(f,col):
    #This is to find the mean of a feature
    sigma = 0
    count = 0
    data = open(f,'r')
    for line  in data:
        points = line.split(",")
        sigma = sigma + float(points[col].strip("
"))
        count+=1
    data.close()
    return sigma/count
def size(f):
    count = 0
    data = open(f,'r')

    for line in data:
        count +=1
    data.close()
    return count
def standard_dev(f,col):
    #Calculate the standard_dev . Formula : Sqrt ( Sigma ( x - x') ** (x-x') ) / N ) 
    data = open(f,'r')
    sigma = 0
    mean = 0
    if(col==0):
        mean = mean_area
    else:
        mean = mean_bedroom
    for line in data:
        points = line.split(",")
        sigma  = sigma + (float(points[col].strip("
")) - mean) ** 2
    data.close()
    return sqrt(sigma/SIZE)

def substitute(f,fnew):
    ''' Take the old file.  
        1. Subtract the mean values from each feature
        2. Scale it by dividing with the SD
    '''
    data = open(f,'r')
    data_new = open(fnew,'w')
    for line in data:
        points = line.split(",")
        new_area = (float(points[0]) - mean_area ) / sd_area
        new_bedroom = (float(points[1].strip("
")) - mean_bedroom) / sd_bedroom
        data_new.write("1,"+str(new_area)+ ","+str(new_bedroom)+","+str(points[2].strip("
"))+"
")
    data.close()
    data_new.close()
global mean_area
global mean_bedroom
mean_bedroom = mean(sys.argv[1],1)
mean_area = mean(sys.argv[1],0)
print 'Mean number of bedrooms',mean_bedroom
print 'Mean area',mean_area
global SIZE
SIZE = size(sys.argv[1])
global sd_area
global sd_bedroom
sd_area = standard_dev(sys.argv[1],0)
sd_bedroom=standard_dev(sys.argv[1],1)
substitute(sys.argv[1],sys.argv[2])

I have implemented mean and Standard deviation in the code, instead of using NumPy/SciPy. After storing the values in a file , a snapshot of which is the following:

X1 X2 X3 COST OF HOUSE

1,0.131415422021,-0.226093367578,399900
1,-0.509640697591,-0.226093367578,329900
1,0.507908698618,-0.226093367578,369000
1,-0.743677058719,-1.5543919021,232000
1,1.27107074578,1.10220516694,539900
1,-0.0199450506651,1.10220516694,299900
1,-0.593588522778,-0.226093367578,314900
1,-0.729685754521,-0.226093367578,198999
1,-0.789466781548,-0.226093367578,212000
1,-0.644465992588,-0.226093367578,242500

I run regression on it to find the parameters. The code for that is below:

''' The plan is to rewrite and this time, calculate cost each time to ensure its reducing. Also make it  enough to handle multiple variables '''
from __future__ import division
import os,sys

def computecost(X,Y,theta):
    #X is the feature vector, Y is the predicted variable
    h_theta=calculatehTheta(X,theta)
    delta = (h_theta - Y) * (h_theta - Y)
    return (1/194) * delta 



def allCost(f,no_features):
    theta=[0,0]
    sigma=0
    data = open(f,'r')
    for line in data:
        X=[]
        Y=0
        points=line.split(",")
        for i in range(no_features):
            X.append(float(points[i]))
        Y=float(points[no_features].strip("
"))
        sigma=sigma+computecost(X,Y,theta)
    return sigma

def calculatehTheta(points,theta):
    #This takes a file which has  (1,feature1,feature2,so ... on)
    #print 'Points are',points
    sigma  = 0 
    for i in range(len(theta)):

        sigma = sigma + theta[i] * float(points[i])
    return sigma



def gradient_Descent(f,no_iters,no_features,theta):
    ''' Calculate ( h(x) - y ) * xj(i) . And then subtract it from thetaj . Continue for 1500 iterations and you will have your answer'''


    X=[]
    Y=0
    sigma=0
    alpha=0.01
    for i in range(no_iters):
        for j in range(len(theta)):
            data = open(f,'r')
            for line in data:
                points=line.split(",")
                for i in range(no_features):
                    X.append(float(points[i]))
                Y=float(points[no_features].strip("
"))
                h_theta = calculatehTheta(points,theta)
                delta = h_theta - Y
                sigma = sigma + delta * float(points[j])
            data.close()
            theta[j] = theta[j] - (alpha/97) * sigma

            sigma = 0
    print theta

print allCost(sys.argv[1],2)
print gradient_Descent(sys.argv[1],1500,2,[0,0,0])

It prints the following as the parameters:

[-3.8697149722857996e-14, 0.02030369056348706, 0.979706406501678]

All three are horribly wrong :( The exact same thing works with Single variable .

Thanks !

mdaoust · Accepted Answer

The global variables and quadruply nested loops worry me. That and reading and writing the data to files multiple times.

Is your data so big it doesn't easily fit in memory?

Why not use the csv module for the file processing?

Why not you use Numpy for the numeric part ?

Don't reinvent the wheel

Assuming your data entries are rows, you can normalize your data and do a least squares fit in two lines:

normData = (data-data.mean(axis = 0))/data.std(axis = 0)
c = numpy.dot(numpy.linalg.pinv(normData),prices)

Reply to comment from Original Poster:

Ok, then the only other advice I can give you then is try to break it up into smaller pieces, so that it's easier to see what's going on. and it's easier to sanity-check the small parts.

It's probably not the problem, but you are using i as the index for two of the loops in that quadruple loop. That's the sort of problem you can avoid by cutting it into smaller scopes.

I think it's been years since I wrote an explicitly nested loop, or declared a global variable.

Linear Regression with Multiple Variables - Python - Implementation issues

Answers (1)

Related Questions