Reputation: 65
I am having a seemingly simple but very frustrating problem. When you run a model with an interaction term in R, R names the parameter generated "var1:var2" etc. Unfortunately, this naming convention prevents me from calculating predicted values and CI's where newdata is required, because ":" is not a character that can be included in a column header, and the names in the original data frame must exactly match those in newdata. Has anyone else had this problem?
Here is a sample of my code:
wemedist2.exp = glm(survive/trials ~ sitedist + type + sitedist*type + roaddist, family = binomial(logexp(wemedata$expos)), data=wemedata)
summary(wemedist2.exp)
wemepredict3 = with(wemedata, data.frame(sitedist=mean(sitedist),roaddist=mean(roaddist), type=factor(1:2)))
wemepredict3 = cbind(wemepredict3, predict(wemedist2.exp, newdata = wemepredict3, type = "link", se = TRUE))
This produces a table with predicted values for each of the variables at the specified levels, but not interaction.
Upvotes: 3
Views: 1386
Reputation: 27388
For your newdata
data frame, you shouldn't include columns for the interactions. The product of the interactive variables will be calculated for you (and multiplied by the estimated coefficient) when calling predict
.
For example:
Create some dummy data:
set.seed(1)
n <- 10000
X <- data.frame(x1=runif(n), x2=runif(n))
X$x1x2 <- X$x1 * X$x2
head(X)
# x1 x2 x1x2
# 1 0.2655087 0.06471249 0.017181728
# 2 0.3721239 0.67661240 0.251783646
# 3 0.5728534 0.73537169 0.421260147
# 4 0.9082078 0.11129967 0.101083225
# 5 0.2016819 0.04665462 0.009409393
# 6 0.8983897 0.13091031 0.117608474
b <- runif(4)
y <- b[1] + c(as.matrix(X) %*% b[-1]) + rnorm(n, sd=0.1)
Fit the model and compare the estimated vs. true coefficients:
M <- lm(y ~ x1 * x2, X)
summary(M)
# Call:
# lm(formula = y ~ x1 * x2, data = X)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.43208 -0.06743 -0.00170 0.06601 0.37197
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.202040 0.003906 51.72 <2e-16 ***
# x1 0.128237 0.006809 18.83 <2e-16 ***
# x2 0.156942 0.006763 23.21 <2e-16 ***
# x1:x2 0.292582 0.011773 24.85 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.09906 on 9996 degrees of freedom
# Multiple R-squared: 0.5997, Adjusted R-squared: 0.5996
# F-statistic: 4992 on 3 and 9996 DF, p-value: < 2.2e-16
b
# [1] 0.2106027 0.1147864 0.1453641 0.3099322
Create example data to predict to, and do prediction. Note that we only create x1
and x2
, and do not create x1:x2
:
X.predict <- data.frame(x1=runif(10), x2=runif(10))
head(X.predict)
# x1 x2
# 1 0.26037592 0.7652155
# 2 0.73988333 0.3352932
# 3 0.02650689 0.9788743
# 4 0.84083874 0.1446228
# 5 0.85052685 0.7674547
# 6 0.13568509 0.9612156
predict(M, newdata=X.predict)
# 1 2 3 4 5 6 7
# 0.4138194 0.4221251 0.3666572 0.3681432 0.6225354 0.4084543 0.4711018
# 8 9 10
# 0.7092744 0.3401867 0.2320834
Or...
An alternative approach is to include the interactions in your model-fitting data by calculating the product of the interactive terms, and then include this in your new data as well. We've done the first step in point 1 above, where we created a column called x1x2
.
Then we would fit the model with: lm(y ~ x1 + x2 + x1x2, X)
And predict to the following data:
X.predict <- data.frame(x1=runif(10), x2=runif(10), x1x2=runif(10)
If you have categorical variables involved in interactions...
When you have interactions involving categorical variables, the model estimates coefficients describing the effect of belonging to each level relative to belonging to a reference level. So for instance if we have one continuous predictor (x1
) and one categorical predictor (x2
, with levels a
, b
, and c
), then the model y ~ x1 * x2
will estimate six coefficients, describing:
the intercept (i.e. the predicted y
when x1
is zero and the observation belongs to the reference level of x2
);
the effect of varying x1
when the observation belongs to the reference level of x2
(i.e. the slope, for the reference level of x2
);
the effect of belonging to the second level (i.e. the change in intercept due to belonging to the second level, relative to belonging to the reference level);
the effect of belonging to the third level (i.e. the change in intercept due to belonging to the third level, relative to belonging to the reference level);
the change in the effect of x1
(i.e. change in slope) due to belonging to the second level, relative to belonging to the reference level; and
the change in the effect of x1
(i.e. change in slope) due to belonging to the third level, relative to belonging to the reference level.
If you want to fit and predict the model with/to pre-calculated data describing the interaction, you can create a dataframe that includes columns: x1
; x2b
(binary, indicating whether the observation belongs to level b
); x2c
(binary, indicating whether the observation belongs to level c
); x1x2b
(the product of x1
and x2b
); and x1x2c
(the product of x1
and x2c
).
A quick way to do this is with model.matrix
:
set.seed(1)
n <- 1000
d <- data.frame(x1=runif(n), x2=sample(letters[1:3], n, replace=TRUE))
head(d)
# x1 x2
# 1 0.2655087 b
# 2 0.3721239 c
# 3 0.5728534 b
# 4 0.9082078 c
# 5 0.2016819 a
# 6 0.8983897 a
X <- model.matrix(~x1*x2, d)
head(X)
# (Intercept) x1 x2b x2c x1:x2b x1:x2c
# 1 1 0.2655087 1 0 0.2655087 0.0000000
# 2 1 0.3721239 0 1 0.0000000 0.3721239
# 3 1 0.5728534 1 0 0.5728534 0.0000000
# 4 1 0.9082078 0 1 0.0000000 0.9082078
# 5 1 0.2016819 0 0 0.0000000 0.0000000
# 6 1 0.8983897 0 0 0.0000000 0.0000000
b <- rnorm(6) # coefficients
y <- X %*% b + rnorm(n, sd=0.1)
You can rename the columns of X
to whatever you want, as long as you use consistent naming when predict
ing the model to new data later.
Now fit the model. Here I tell lm
not to calculate an intercept (with -1
), since the variable (Intercept)
already exists in X
and will have a coefficient calculated for it. We could have also done this by fitting to data as.data.frame(X[, -1])
:
(M <- lm(y ~ . - 1, as.data.frame(X)))
# Call:
# lm(formula = y ~ . - 1, data = as.data.frame(X))
#
# Coefficients:
# `(Intercept)` x1 x2b x2c `x1:x2b` `x1:x2c`
# 1.14389 1.09168 -0.88879 0.20405 0.09085 -1.63769
Create some new data to predict to, and carry out the prediction:
d.predict <- expand.grid(x1=seq(0, 1, 0.1), x2=letters[1:3])
X.predict <- model.matrix(~x1*x2, d.predict)
y.predict <- predict(M, as.data.frame(X.predict))
Upvotes: 3