whi
whi

Reputation: 2750

how to do linear regression in python, with missing elements

I found an example of linear regression:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html#numpy.linalg.lstsq

x = np.array([0, 1, 2, 3])
y = np.array([-1, 0.2, 0.9, 2.1])
A = np.vstack([x, np.ones(len(x))]).T
m, c = np.linalg.lstsq(A, y)[0]
print m, c

My situation is: some element of y is missing, so x and y are not same length. it need some intel to judge which position is missing, so rm it. is there method at hand, or should i do it myself?

e.g.:

x=range(10)
y=[i*3+5 for  i in x]
y.pop(3) #make a missing

i don't known which position is missing. But consider slope change on average, possibly position 4 of y is missing.
this maybe a question on special domain

Upvotes: 3

Views: 4314

Answers (3)

Pierre GM
Pierre GM

Reputation: 20339

I'm afraid you gonna be in troubles with your way of making missing values:

y=[i*3+5 for  i in x]
y.pop(3) #make a missing

You specifically want to make the 3rd element missing, but what happens now? How are you supposed to tell your script that in fact, the initial 3rd element is missing ?

I would suggest to flag your missing values as np.nan (provided they're all floats, of course). Then, finding what values are missing is easy:

missing = np.isnan(y)

Now, you can remove the entries of x and A where y is missing, ie where y is np.nan:

Anew = A[~missing]
ynew = y[~missing]

m, c = np.linalg.lstsq(Anew, ynew)[0]
print m, c

(the ~ operator transforms your True as False and vice-versa: you're selecting entries where y is not np.nan)

If your y are actually integers, that won't work, as np.nan is only for floats. You could then use the np.ma module.

my = np.ma.masked_array(y)
my[3] = np.ma.masked

Anew = A[~my.mask]
ynew = my.compressed()

m, c = np.linalg.lstsq(Anew, ynew)[0]
print m, c

Upvotes: 4

whi
whi

Reputation: 2750

i have a rough solution below:

def slope(X,Y,i):
    res = (Y[i]-Y[0])*1.0/(X[i]-X[0])
    return res

len_thold=0.2

def notgood(lst1,lst2):
    if len(lst1)<2 or len(lst2)<2:
        return True
    return  False

def adjust_miss(X,Y):
    slope_thold=1.1
    if len(X)==len(Y):
        return
    newlen=min(len(X),len(Y))  
    if len(Y)-len(X)<0:
        aim=X
    else:
        aim=Y
    difflen=abs(len(Y)-len(X))
    roughk=slope(X,Y,newlen-1)
    for i in xrange(1,newlen):
        if difflen==0:
            break
        k=slope(X,Y,i)
        if (len(Y)<len(X) and k>slope_thold*roughk) or (len(Y)>len(X) and k<1.0/(slope_thold*roughk)):
            aim.pop(i)
            difflen-=1
    if difflen>0:
        for i in xrange(difflen):
            aim.pop(-1) 
    assert len(X) == len(Y)

def test_adjust():
    X=range(10)
    Y=range(10)
    Y.pop(3)
    adjust_miss(X,Y)
    print X,Y

Upvotes: 0

user1149913
user1149913

Reputation: 4523

I am assuming that you know which of the x's are associated with missing elements of y.

In this case, you have a transductive learning problem, because you want to estimate values of y for known positions of x.

In the probabilistic linear regression formulation, learning a distribution p(y|x), it turns out that there is no difference between the transductive solution and the answer you get by just running regression after removing the x's with no associated y's.

So the answer is - just remove the x's with no associated y's and run linear regression on the reduced problem.

Upvotes: 1

Related Questions