Reputation: 13682
I have a dataset contained in dataframe daf
which I am splitting into training and test data based on the date e.g. train on dates below 20090000
and test on dates above. To do this, we split the original dataframe into daf_train
and daf_test
.
I am using GLM
and have a factor in the model daf$city
. The issue that is arising is that daf_test
sometimes contains a new city that was not seen in daf_train
.
I am thinking the best way around this is to do something like
levels(daf_train$city) = levels(daf$city)
to prewarn it about all possible cities.
I would then like the GLM
to recognise that for cities that have not been seen before, take an average of the factor coefficients for cities. If all the previous factors' coefficients had mean zero I think this would be good enough.
How would I alter the code to do this
mylogit = glm(Y ~ X + factor(city), data=daf_train, family=binomial(link='logit'))
predictions = predict(mylogit, daf_test, type='response')
Note, a really ugly and non general way to do this (I am also fairly new to R so maybe this will also mess with the GLM object) is
cityLevels = levels(factor(daf$city))
daf_train$city = factor(daf_train$city, cityLevels)
# daf_train$city now has all the levels of the overall dataset
# But if we train a GLM now, it will ignore any levels without observations
# Instead we split the factor into binary variables
train_data = cbind(daf_train, model.matrix( ~ 0 + city, daf_train))
# Remove the factor variable
train_data$city = NULL
# Now train the GLM
mylogit = glm(Y ~., data = train_data, family=binomial(link='logit'))
# This gives us coefficient values for all factors in the training set
# Any factors not in the training set get coefficient values of NA
# Finally we must convert the factor coefficients to have zero mean
offset = mean(mylogit$coefficients[-1:-34])
mylogit$coefficients[-1:-34] = mylogit$coefficients[-1:-34] - offset
mylogit$coefficients[1] = mylogit$coefficients[1] + offset
# Yeuch, this required us to know where in our coefficients vector our cities started (34)
Upvotes: 1
Views: 1070
Reputation: 6203
I think this is what you are looking for. Still pretty ugly, but I think it's a bit more general than your code. Let me know if it misses the mark and I can modify or delete.
# dummy data
set.seed(321)
daf_train <- data.frame(x = runif(100, min=10, max=50),
y = runif(100),
city = sample(c("city1", "city2", "city3"), size=100, replace=TRUE))
set.seed(321)
daf_test <- data.frame(x = runif(30, min=10, max=50),
y = runif(30),
city = sample(c("city1", "city2", "city3", "city4"), size=30, replace=TRUE))
daf_train$city <- factor(daf_train$city, levels=levels(daf_test$city))
# cities in test set but not train set
(newcity <- sort(unique(daf_test$city))[!sort(unique(daf_test$city)) %in% unique(daf_train$city)])
[1] city4
Levels: city1 city2 city3 city4
# fit model with city1, city2, city3
xreg <- cbind(x=daf_train$x, model.matrix(~ 0 + city, data=daf_train))
mylogit = glm(y ~ xreg, data=daf_train, family=binomial(link='logit'))
newxreg <- cbind(x=daf_test$x, model.matrix(~ 0 + city, data=daf_test))
# mean of city coefficients
if (length(newcity) > 0) {
# coefficients from model
citycoef <- coef(mylogit)[grepl("city", names(coef(mylogit)))]
# calculate coefficient for new city(cities)
citycoef_offset <- mean(citycoef, na.rm=TRUE)
# repeat for all new cities
citycoef[is.na(citycoef)] <- citycoef_offset
# center coefficients
citycoef <- scale(citycoef, center=TRUE, scale=FALSE)[, 1]
# replace city coefficients from model
modelcoef <- coef(mylogit)
# add offset to intercept
modelcoef[["(Intercept)"]] <- modelcoef[["(Intercept)"]] + citycoef_offset
# all new coefficients
modelcoef[match(names(citycoef), names(modelcoef))] <- citycoef
# Beta0 + Beta1x...
pcoef <- modelcoef[["(Intercept)"]] +
newxreg %*%
modelcoef[!names(modelcoef) == "(Intercept)"]
#predicted response
predictions <- unlist(lapply(pcoef, function(x) exp(x) / (1 + exp(x))))
} else {
predictions <- predict(mylogit, daf_test, type="response")
}
Upvotes: 1