Handling Unknown Factor Levels in R GLM

Question

I have a dataset contained in dataframe daf which I am splitting into training and test data based on the date e.g. train on dates below 20090000 and test on dates above. To do this, we split the original dataframe into daf_train and daf_test.

I am using GLM and have a factor in the model daf$city. The issue that is arising is that daf_test sometimes contains a new city that was not seen in daf_train.

I am thinking the best way around this is to do something like

levels(daf_train$city) = levels(daf$city)

to prewarn it about all possible cities.

I would then like the GLM to recognise that for cities that have not been seen before, take an average of the factor coefficients for cities. If all the previous factors' coefficients had mean zero I think this would be good enough.

How would I alter the code to do this

mylogit = glm(Y ~ X + factor(city), data=daf_train, family=binomial(link='logit'))
predictions = predict(mylogit, daf_test, type='response')

Note, a really ugly and non general way to do this (I am also fairly new to R so maybe this will also mess with the GLM object) is

cityLevels = levels(factor(daf$city))
daf_train$city = factor(daf_train$city, cityLevels)

# daf_train$city now has all the levels of the overall dataset 
# But if we train a GLM now, it will ignore any levels without observations

# Instead we split the factor into binary variables
train_data = cbind(daf_train, model.matrix( ~ 0 + city, daf_train))
# Remove the factor variable
train_data$city = NULL

# Now train the GLM
mylogit = glm(Y ~., data = train_data, family=binomial(link='logit'))

# This gives us coefficient values for all factors in the training set
# Any factors not in the training set get coefficient values of NA

# Finally we must convert the factor coefficients to have zero mean
offset = mean(mylogit$coefficients[-1:-34])
mylogit$coefficients[-1:-34] = mylogit$coefficients[-1:-34] - offset
mylogit$coefficients[1] = mylogit$coefficients[1] + offset

# Yeuch, this required us to know where in our coefficients vector our cities started (34)

Handling Unknown Factor Levels in R GLM

Answers (1)

Related Questions