Palace Chan
Palace Chan

Reputation: 9213

How come I get this logistic regression error in glm/glm2 if I don't exhibit linear separation in my data?

I started running into the error (converted from warning):

glm.fit (or glm.fit2): fitted probabilities numerically 0 or 1 occurred

I found this link referencing linear separation of data:

[R] glm.fit: "fitted probabilities numerically 0 or 1 occurr

So I tried hunting through the data and found a small reproducible example from a small subset of the data (both glm and glm2) where I don't actually see the linear separation and yet I get the error:

response = c(0,1,0,1,0,0,0,0,0,0)
dependent = c(133,571,1401,4930,3134075,44357054,1718619387,1884020779,8970035092,9392823637)
foo = data.frame(y=response,x=dependent)
glm(y ~ x, family=binomial, data=foo)

I can avoid the issue by transforming the dependent via log(x+1), however, this is monotonic and doesn't alter the ordering so I'm not sure why that helps and whether I should be doing so. The dependents are "microseconds since the last time some event happened" which is why some values can be large. I tried turning it into a two level factor of (recent, not recent) but that loses information and underperforms the raw values.

Upvotes: 1

Views: 1526

Answers (2)

IRTFM
IRTFM

Reputation: 263451

It's not an error and your claim that it was labeled an error by the system is misleading. It was a warning and clearly labeled as such. Plot your data first, then answer the question: What would be your estimate for the probability when the "dependent"-variable was above 1e+09? enter image description here

If your answer is different than zero, I think you need to explain why that is so.

 png(); plot(response~dependent); lines( seq(0, 1e10, length=100) , predict(fit,  list(x=seq(0, 1e10, length=100)), type="response"), col="red" ); dev.off()

enter image description here

Upvotes: 1

mlegge
mlegge

Reputation: 6913

I think this is just a feature of the data and the rounding of the floating point calculations going on in the optimization of the maximum likelihood function.

Take a look at the fitted values of the log transformed set:

> response = c(0,1,0,1,0,0,0,0,0,0)
> dependent = c(133,571,1401,4930,3134075,44357054,1718619387,1884020779,8970035092,9392823637)
> 
> foo = data.frame(y=response,x=log(dependent))
> mlog <- glm(y ~ x, family=binomial, data=foo)
> mlog$fitted
          1           2           3           4 
0.584089292 0.484155299 0.422713978 0.340825478 
          5           6           7           8 
0.079815887 0.040011202 0.014931996 0.014562755 
          9          10 
0.009506656 0.009387457 

Whereas the untransformed set results in the occurance miniscule fitted values:

> foo = data.frame(y=response,x=dependent)
> m <- glm(y ~ x, family=binomial, data=foo)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 
> m$fitted.values
           1            2            3 
5.007959e-01 5.005387e-01 5.000511e-01 
           4            5            6 
4.979784e-01 6.359085e-04 2.220446e-16 
           7            8            9 
2.220446e-16 2.220446e-16 2.220446e-16 
          10 
2.220446e-16 

Doesn't seem to be a warning related to complete (or quasi) separation of the data. I think the warning is pretty informative in this case.

Upvotes: 2

Related Questions