Reputation: 9213
I started running into the error (converted from warning):
glm.fit (or glm.fit2): fitted probabilities numerically 0 or 1 occurred
I found this link referencing linear separation of data:
[R] glm.fit: "fitted probabilities numerically 0 or 1 occurr
So I tried hunting through the data and found a small reproducible example from a small subset of the data (both glm and glm2) where I don't actually see the linear separation and yet I get the error:
response = c(0,1,0,1,0,0,0,0,0,0)
dependent = c(133,571,1401,4930,3134075,44357054,1718619387,1884020779,8970035092,9392823637)
foo = data.frame(y=response,x=dependent)
glm(y ~ x, family=binomial, data=foo)
I can avoid the issue by transforming the dependent via log(x+1)
, however, this is monotonic and doesn't alter the ordering so I'm not sure why that helps and whether I should be doing so. The dependents are "microseconds since the last time some event happened" which is why some values can be large. I tried turning it into a two level factor of (recent, not recent) but that loses information and underperforms the raw values.
Upvotes: 1
Views: 1526
Reputation: 263451
It's not an error and your claim that it was labeled an error by the system is misleading. It was a warning and clearly labeled as such. Plot your data first, then answer the question: What would be your estimate for the probability when the "dependent"-variable was above 1e+09?
If your answer is different than zero, I think you need to explain why that is so.
png(); plot(response~dependent); lines( seq(0, 1e10, length=100) , predict(fit, list(x=seq(0, 1e10, length=100)), type="response"), col="red" ); dev.off()
Upvotes: 1
Reputation: 6913
I think this is just a feature of the data and the rounding of the floating point calculations going on in the optimization of the maximum likelihood function.
Take a look at the fitted values of the log transformed set:
> response = c(0,1,0,1,0,0,0,0,0,0)
> dependent = c(133,571,1401,4930,3134075,44357054,1718619387,1884020779,8970035092,9392823637)
>
> foo = data.frame(y=response,x=log(dependent))
> mlog <- glm(y ~ x, family=binomial, data=foo)
> mlog$fitted
1 2 3 4
0.584089292 0.484155299 0.422713978 0.340825478
5 6 7 8
0.079815887 0.040011202 0.014931996 0.014562755
9 10
0.009506656 0.009387457
Whereas the untransformed set results in the occurance miniscule fitted values:
> foo = data.frame(y=response,x=dependent)
> m <- glm(y ~ x, family=binomial, data=foo)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
> m$fitted.values
1 2 3
5.007959e-01 5.005387e-01 5.000511e-01
4 5 6
4.979784e-01 6.359085e-04 2.220446e-16
7 8 9
2.220446e-16 2.220446e-16 2.220446e-16
10
2.220446e-16
Doesn't seem to be a warning related to complete (or quasi) separation of the data. I think the warning is pretty informative in this case.
Upvotes: 2