rwolst
rwolst

Reputation: 13682

Handling Unknown Factor Levels in R GLM

I have a dataset contained in dataframe daf which I am splitting into training and test data based on the date e.g. train on dates below 20090000 and test on dates above. To do this, we split the original dataframe into daf_train and daf_test.

I am using GLM and have a factor in the model daf$city. The issue that is arising is that daf_test sometimes contains a new city that was not seen in daf_train.

I am thinking the best way around this is to do something like

levels(daf_train$city) = levels(daf$city)

to prewarn it about all possible cities.

I would then like the GLM to recognise that for cities that have not been seen before, take an average of the factor coefficients for cities. If all the previous factors' coefficients had mean zero I think this would be good enough.

How would I alter the code to do this

mylogit = glm(Y ~ X + factor(city), data=daf_train, family=binomial(link='logit'))
predictions = predict(mylogit, daf_test, type='response')

Note, a really ugly and non general way to do this (I am also fairly new to R so maybe this will also mess with the GLM object) is

cityLevels = levels(factor(daf$city))
daf_train$city = factor(daf_train$city, cityLevels)

# daf_train$city now has all the levels of the overall dataset 
# But if we train a GLM now, it will ignore any levels without observations

# Instead we split the factor into binary variables
train_data = cbind(daf_train, model.matrix( ~ 0 + city, daf_train))
# Remove the factor variable
train_data$city = NULL

# Now train the GLM
mylogit = glm(Y ~., data = train_data, family=binomial(link='logit'))

# This gives us coefficient values for all factors in the training set
# Any factors not in the training set get coefficient values of NA

# Finally we must convert the factor coefficients to have zero mean
offset = mean(mylogit$coefficients[-1:-34])
mylogit$coefficients[-1:-34] = mylogit$coefficients[-1:-34] - offset
mylogit$coefficients[1] = mylogit$coefficients[1] + offset

# Yeuch, this required us to know where in our coefficients vector our cities started (34)

Upvotes: 1

Views: 1070

Answers (1)

Whitebeard
Whitebeard

Reputation: 6203

I think this is what you are looking for. Still pretty ugly, but I think it's a bit more general than your code. Let me know if it misses the mark and I can modify or delete.

# dummy data
set.seed(321)
daf_train <- data.frame(x = runif(100, min=10, max=50), 
                        y = runif(100), 
                        city = sample(c("city1", "city2", "city3"), size=100, replace=TRUE))

set.seed(321)
daf_test <- data.frame(x = runif(30, min=10, max=50),
                       y = runif(30),
                       city = sample(c("city1", "city2", "city3", "city4"), size=30, replace=TRUE))

daf_train$city <- factor(daf_train$city, levels=levels(daf_test$city))


# cities in test set but not train set
(newcity <- sort(unique(daf_test$city))[!sort(unique(daf_test$city)) %in% unique(daf_train$city)])
[1] city4
Levels: city1 city2 city3 city4

# fit model with city1, city2, city3
xreg <- cbind(x=daf_train$x, model.matrix(~ 0 + city, data=daf_train))

mylogit = glm(y ~ xreg, data=daf_train, family=binomial(link='logit'))

newxreg <- cbind(x=daf_test$x, model.matrix(~ 0 + city, data=daf_test))

# mean of city coefficients
if (length(newcity) > 0) {

  # coefficients from model
  citycoef <- coef(mylogit)[grepl("city", names(coef(mylogit)))]

  # calculate coefficient for new city(cities)
  citycoef_offset <- mean(citycoef, na.rm=TRUE)

  # repeat for all new cities
  citycoef[is.na(citycoef)] <- citycoef_offset 

  # center coefficients
  citycoef <- scale(citycoef, center=TRUE, scale=FALSE)[, 1]

  # replace city coefficients from model
  modelcoef <- coef(mylogit)

  # add offset to intercept
  modelcoef[["(Intercept)"]] <- modelcoef[["(Intercept)"]] + citycoef_offset

  # all new coefficients
  modelcoef[match(names(citycoef), names(modelcoef))] <- citycoef

  # Beta0 + Beta1x...
  pcoef <- modelcoef[["(Intercept)"]] + 
    newxreg %*%
    modelcoef[!names(modelcoef) == "(Intercept)"]

  #predicted response
  predictions <- unlist(lapply(pcoef, function(x) exp(x) / (1 + exp(x))))


} else {
  predictions <- predict(mylogit, daf_test, type="response")
}

Upvotes: 1

Related Questions