Mohan
Mohan

Reputation: 8863

Extracting the intercepts for a dependent factor in a model

Suppose I fit a model in R like this:

model = glm(y ~ x + language, family = binomial, data = data)

language is a factor variable; the idea is that there's a different intercept for each language.

Here are the model coefficients:

> coef(model)
  (Intercept)             x  languageen-GB languageen-US    languageja    languageko 
-17.919438297   0.003119914   -0.427067341  -0.613194669   1.406719444   2.402191148 
   languagezh 
  0.894899827 

One level of the language factor (de) has been chosen as a baseline, and (Intercept) gives the intercept for that baseline. languageen-GB, etc., give intercepts as deltas from the baseline intercept.

This code

coeffs = coef(model)
intercepts = c("baseline" = 0, tail(coeffs, -2)) + coeffs["(Intercept)"]
names(intercepts) <- levels(data$language)
intercepts

pulls out the actual intercepts for each factor level:

       de     en-GB     en-US        ja        ko        zh 
-17.91944 -18.34651 -18.53263 -16.51272 -15.51725 -17.02454 

But it's horrendous code. There must be a nicer way of doing this with model methods or package functions... ?

Edit: one particularly unpleasant part is that the tail(coeffs, -2) will break if you change the formula. I suppose some kind of string search could be used here instead.

Upvotes: 1

Views: 57

Answers (1)

Rui Barradas
Rui Barradas

Reputation: 76663

One way of having no baseline factor level is to fit a model with no intercept. This is done with a formula like y ~ 0 + x + . or by adding -1 instead of 0.

model2 <- glm(y ~ 0 + ., data, family = binomial)
intercepts2 <- coef(model2)[-1]
names(intercepts2) <- levels(data$language)
intercepts2
#       de     en-GB     en-US 
#15.846295  8.696764  6.562384 

Now compare with the result posted in the question.

model <- glm(y ~ ., data, family = binomial)

coeffs = coef(model)
intercepts = c("baseline" = 0, tail(coeffs, -2)) + coeffs["(Intercept)"]
names(intercepts) <- levels(data$language)
intercepts
#       de     en-GB     en-US 
#15.846295  8.696764  6.562384 

all.equal(intercepts, intercepts2)
#[1] TRUE

The results are not identical(), the computations are made in different ways:

intercepts - intercepts2
#          de        en-GB        en-US 
#3.197442e-14 3.907985e-14 3.552714e-14

Data creation code.

I will adapt built in dataset iris as a data example.

data <- iris[c(1,2,5)]
data$y <- +(data[[1]] < 5.8)
data <- data[-1]
names(data)[c(1,2)] <- c('x', 'language')
i1 <- data[[2]] == "setosa"
i2 <- data[[2]] == "versicolor"
i3 <- data[[2]] == "virginica"
data[[2]] <- as.character(data[[2]])
data[[2]][i1] <- 'de'
data[[2]][i2] <- 'en-GB'
data[[2]][i3] <- 'en-US'
data[[2]] <- factor(data[[2]])

Upvotes: 1

Related Questions