Reputation: 678
I have estimated a gravity model for some international flow than uses data on origin countries x*_o
, destination countries x*_d
, and a set ofdistance variables x*
. I now want to see if I can find a more concise model using stepwise model selection. My data looks something like this:
set.seed(450)
data <- data.frame(dep = rnorm(20, 6, 2),
x1_o = rnorm(20, 0, 1),
x1_d = rnorm(20, 5, 3),
x2_o = rnorm(20, 5, 3),
x2_d = rnorm(20, 5, 3),
x3_o = rnorm(20, 5, 3),
x3_d = rnorm(20, 5, 3),
x4 = rnorm(20, 5, 3),
x5 = rnorm(20, 5, 3),
x6 = rnorm(20, 5, 3))
Fit a linear model and do stepwise regression:
lm_fit <- lm(dep ~ ., data = data)
step_fit <- step(lm_fit, direction = "both")
summary(step_fit)
Results:
Call:
lm(formula = dep ~ x1_d + x2_d + x3_o + x3_d + x4 + x6, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.962 -1.003 0.213 0.550 1.955
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.4525 1.5384 6.144 3.52e-05 ***
x1_d -0.1615 0.1141 -1.416 0.18039
x2_d -0.8532 0.2105 -4.053 0.00137 **
x3_o -0.1334 0.1011 -1.320 0.20969
x3_d 0.2332 0.1319 1.768 0.10055
x4 0.2830 0.1304 2.170 0.04914 *
x6 -0.1729 0.1123 -1.539 0.14776
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.411 on 13 degrees of freedom
Multiple R-squared: 0.595, Adjusted R-squared: 0.4081
F-statistic: 3.183 on 6 and 13 DF, p-value: 0.0379
As you can see step
dropped the x1
and x2
variable for the origin countries, but kept them for the destination countries. What I want to achieve is that step
always either keeps or drops a variable for both origin and destination countries. For example, x1_o
and x1_d
should be either both in or both out.
Is this possible in R? The scope
argument provides the option to impose some constraints on the model selection, but I'm not sure it is possible to do what I want using that option.
Thanks.
Upvotes: 0
Views: 572
Reputation: 269586
Define each of the paired columns as a factor with nrow(data)
levels and 2 columns equal to the origin and destination. For any such factor, that will force it to either keep both columns or reject both columns. Using the data in the Note at the end (which is the same as in the question except the random seed has been modified so that the answer is a mix of factors and remaining columns:
nr <- nrow(data)
data2 <- transform(data,
x1 = C(factor(1:nr), cbind(x1_o, x1_d), 2),
x2 = C(factor(1:nr), cbind(x2_o, x2_d), 2),
x3 = C(factor(1:nr), cbind(x3_o, x3_d), 2))
fm <- lm(dep ~ x1 + x2 + x3 + x4 + x5 + x6, data2)
dim(model.matrix(fm)) # check dimensions of model matrix
step(fm)
set.seed(13)
data <- data.frame(dep = rnorm(20, 6, 2),
x1_o = rnorm(20, 0, 1),
x1_d = rnorm(20, 5, 3),
x2_o = rnorm(20, 5, 3),
x2_d = rnorm(20, 5, 3),
x3_o = rnorm(20, 5, 3),
x3_d = rnorm(20, 5, 3),
x4 = rnorm(20, 5, 3),
x5 = rnorm(20, 5, 3),
x6 = rnorm(20, 5, 3))
Upvotes: 1