Reputation: 83
I tried to follow this example modify glm... user specificed link function in r but am getting errors. I have binary data, and would like to change the link function from "logit" to a negative exponential link. I want to predict the
probability of success(p) = 1-exp(linear predictor)
The reason I need this link instead of one of the built-in links is that p increases in a convex manner between 0 and 0.5, but the "logit", "cloglog", "probit", and "cauchy" only allow a concave shape. See attached photo for reference: predicted p vs binned observations
Simulate data
location<-as.character(LETTERS[rep(seq(from=1,to=23),30)])
success<-rbinom(n=690, size=1, prob=0.15)
df<-data.frame(location,success)
df$random_var<-rnorm(690,5,3)
df$seedling_size<-abs((0.1+df$success)^(1/df$random_var))
df<-df[order(df$location)]
Create custom link function. Note: eta = linear predictor, mu = probability
negex<-function(){
##link
linkfun<-function(mu) log(-mu+1)
linkinv<-function(eta) 1-exp(eta)
## derivative of inverse link with respect to eta
mu.eta<-function(eta)-exp(eta)
valideta<-function(eta) TRUE
link<-"log(-mu+1)"
structure(list(linkfun=linkfun,linkinv=linkinv,
mu.eta=mu.eta,valideta=valideta,
name=link),
class="link-glm")
}
Model success as a function of seedling size
negexp<-negex()
model1<-glm(success~seedling_size,family=binomial(link=negexp),data=df)
Error: no valid set of coefficients has been found: please supply starting values
Model using glmer (My ultimate goal)
model2<-glmer(success~seedling_size+ (1|location),family=binomial(link=negexp),data=df)
Error in (function (fr, X, reTrms, family, nAGQ = 1L, verbose = 0L, maxit = 100L, : (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
I get different error messages, but I think the problem is the same regardless of whether using glmer or glm, and that is that my link function is wrong somehow.
Upvotes: 3
Views: 591
Reputation: 83
I found the answer. Most helpful was this R thread from 2016. There were 2 issues. First, my link fuction was wrong. I revised it to this:
negex <- function()
{
linkfun <- function(mu) -log(1-mu)
linkinv <- function(eta) 1-exp(-eta)
mu.eta <- function(eta) exp(-eta)
valideta <- function(eta) all(is.finite(eta)&eta>0)
link <- paste0("negexp")
structure(list(linkfun = linkfun, linkinv = linkinv,
mu.eta = mu.eta, valideta = valideta, name = link),
class = "link-glm")
}
Second, the model required specific starting values. These will be unique to your data. Here is the first few lines of the data that I actually found the solution to:
site plot sub_plot oak_success oak_o1_gt05ft..1
0001 10 3 1 0
0001 12 2 0 0
0001 12 3 0 0
0001 12 4 0 0
0001 13 4 0 0
I don't know how to post the full data to this site, but if someone wants it to run the example, shoot me an email: [email protected]
negexp<-negex()
Hopefully this helps someone in the future, because I found no other examples of this being solved on stack overflow or online. Using the new starting values, I was able to get the model to run:
starting_values<-c(1,0) #1 for the intercept and 0 for the slope
h_gt05_solo_negex2<-glm(oak_success~ oak_o1_gt05ft..1 ,
family=binomial(link=negexp),start=starting_values,data=rocdf)
summary(h_gt05_solo_negex2)
Call:
glm(formula = oak_success ~ oak_o1_gt05ft..1, family = binomial(link = negexp),
data = lt40, start = starting_values)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3808 -0.4174 -0.2637 -0.2637 2.5985
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.034774 0.005484 6.341 2.28e-10 ***
oak_o1_gt05ft..1 0.023253 0.002187 10.635 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1416.9 on 2078 degrees of freedom
Residual deviance: 1213.5 on 2077 degrees of freedom
AIC: 1217.5
Number of Fisher Scoring iterations: 6
There were some issues with convergence. As seedling heights (oak_o1_gt05ft..1) got above 40ft, the parameter estimates became unreliable convergence issues. I had very few observations in this range, so I restricted the data to observations were the predictor was <40ft and re-ran the model. I also included "site" (same as "location" in the simulated data)). What you see in this figure are the predictions of oak success with respect to oak seedling heights for each site/location (black circles), the binned observations of successes/samples (large green dots) and the prediction of success probability without a site factor (blue line). It looks like the slope of the seedling size variable is more accurate when site is factored in.
Unfortunately, I was not able to get this model to run in glmer, so site had to be included as a fixed effect, thus, the standard errors and slope estimates for oak seedling height might be a bit conservative.
Upvotes: 3