Reputation: 2750
I found an example of linear regression:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html#numpy.linalg.lstsq
x = np.array([0, 1, 2, 3])
y = np.array([-1, 0.2, 0.9, 2.1])
A = np.vstack([x, np.ones(len(x))]).T
m, c = np.linalg.lstsq(A, y)[0]
print m, c
My situation is: some element of y is missing, so x and y are not same length. it need some intel to judge which position is missing, so rm it. is there method at hand, or should i do it myself?
e.g.:
x=range(10)
y=[i*3+5 for i in x]
y.pop(3) #make a missing
i don't known which position is missing. But consider slope change on average, possibly position 4 of y is missing.
this maybe a question on special domain
Upvotes: 3
Views: 4314
Reputation: 20339
I'm afraid you gonna be in troubles with your way of making missing values:
y=[i*3+5 for i in x]
y.pop(3) #make a missing
You specifically want to make the 3rd element missing, but what happens now? How are you supposed to tell your script that in fact, the initial 3rd element is missing ?
I would suggest to flag your missing values as np.nan
(provided they're all floats, of course). Then, finding what values are missing is easy:
missing = np.isnan(y)
Now, you can remove the entries of x
and A
where y
is missing, ie where y
is np.nan
:
Anew = A[~missing]
ynew = y[~missing]
m, c = np.linalg.lstsq(Anew, ynew)[0]
print m, c
(the ~
operator transforms your True
as False
and vice-versa: you're selecting entries where y
is not np.nan
)
If your y
are actually integers, that won't work, as np.nan
is only for floats. You could then use the np.ma
module.
my = np.ma.masked_array(y)
my[3] = np.ma.masked
Anew = A[~my.mask]
ynew = my.compressed()
m, c = np.linalg.lstsq(Anew, ynew)[0]
print m, c
Upvotes: 4
Reputation: 2750
i have a rough solution below:
def slope(X,Y,i):
res = (Y[i]-Y[0])*1.0/(X[i]-X[0])
return res
len_thold=0.2
def notgood(lst1,lst2):
if len(lst1)<2 or len(lst2)<2:
return True
return False
def adjust_miss(X,Y):
slope_thold=1.1
if len(X)==len(Y):
return
newlen=min(len(X),len(Y))
if len(Y)-len(X)<0:
aim=X
else:
aim=Y
difflen=abs(len(Y)-len(X))
roughk=slope(X,Y,newlen-1)
for i in xrange(1,newlen):
if difflen==0:
break
k=slope(X,Y,i)
if (len(Y)<len(X) and k>slope_thold*roughk) or (len(Y)>len(X) and k<1.0/(slope_thold*roughk)):
aim.pop(i)
difflen-=1
if difflen>0:
for i in xrange(difflen):
aim.pop(-1)
assert len(X) == len(Y)
def test_adjust():
X=range(10)
Y=range(10)
Y.pop(3)
adjust_miss(X,Y)
print X,Y
Upvotes: 0
Reputation: 4523
I am assuming that you know which of the x's are associated with missing elements of y.
In this case, you have a transductive learning problem, because you want to estimate values of y for known positions of x.
In the probabilistic linear regression formulation, learning a distribution p(y|x), it turns out that there is no difference between the transductive solution and the answer you get by just running regression after removing the x's with no associated y's.
So the answer is - just remove the x's with no associated y's and run linear regression on the reduced problem.
Upvotes: 1