Reputation: 711
I have a dataset (data frame) with 5 columns all containing numeric values.
I'm looking to run a simple linear regression for each pair in the dataset.
For example, If the columns were named A, B, C, D, E
, I want to run lm(A~B), lm(A~C), lm(A~D), ...., lm(D~E)
,... and, then I want to plot the data for each pair along with the regression line.
I'm pretty new to R so I'm sort of spinning my wheels on how to actually accomplish this. Should I use ddply
? or lapply
? I'm not really sure how to tackle this.
Upvotes: 6
Views: 10328
Reputation: 2302
I would recommend to also look at the correlation matrix (cor(DF)
), which is usually the best way to discover linear relationships between variables. The correlation is tightly linked to the covariance and the slopes of a simple linear regression. The computation below exemplifies this link.
Sample data:
set.seed(1)
DF <- data.frame(
A=rnorm(50, 100, 3),
B=rnorm(50, 100, 3),
C=rnorm(50, 100, 3),
D=rnorm(50, 100, 3),
E=rnorm(50, 100, 3)
)
The regression slope is cov(x, y) / var(x)
beta = cov(DF) * (1/diag(var(DF)))
A B C D E
A 1.00000000 -0.045548503 0.028448192 -0.32982367 0.01800795
B -0.03354243 1.000000000 0.003298708 -0.02489518 0.04501362
C 0.02429041 0.003824755 1.000000000 0.24269838 0.15550116
D -0.22407592 -0.022967212 0.193107904 1.00000000 -0.08977834
E 0.01038445 0.035248685 0.105020194 -0.07620397 1.00000000
The intercept is mean(y) - beta * mean(x)
colMeans(DF) - beta * colMeans(DF)
A B C D E
A 1.421085e-14 104.86992 97.44795 133.38310 98.49512
B 1.037180e+02 0.00000 100.02095 102.85026 95.83477
C 9.712461e+01 99.16182 0.00000 75.38373 84.06356
D 1.226899e+02 102.53263 80.87529 0.00000 109.22915
E 9.886859e+01 96.38451 89.41391 107.51930 0.00000
Upvotes: 2
Reputation: 61214
Here's one solution using combn
combn(names(DF), 2, function(x){lm(DF[, x])}, simplify = FALSE)
Example:
set.seed(1)
DF <- data.frame(A=rnorm(50, 100, 3),
B=rnorm(50, 100, 3),
C=rnorm(50, 100, 3),
D=rnorm(50, 100, 3),
E=rnorm(50, 100, 3))
Updated: adding @Henrik suggestion (see comments)
# only the coefficients
> results <- combn(names(DF), 2, function(x){coefficients(lm(DF[, x]))}, simplify = FALSE)
> vars <- combn(names(DF), 2)
> names(results) <- vars[1 , ] # adding names to identify variables in the reggression
> results
$A
(Intercept) B
103.66739418 -0.03354243
$A
(Intercept) C
97.88341555 0.02429041
$A
(Intercept) D
122.7606103 -0.2240759
$A
(Intercept) E
99.26387487 0.01038445
$B
(Intercept) C
99.971253525 0.003824755
$B
(Intercept) D
102.65399702 -0.02296721
$B
(Intercept) E
96.83042199 0.03524868
$C
(Intercept) D
80.1872211 0.1931079
$C
(Intercept) E
89.0503893 0.1050202
$D
(Intercept) E
107.84384655 -0.07620397
Upvotes: 7
Reputation: 15458
Using combn
for all combination of names of column (in the following example I assumed you want combination of two columns only) and Map
for running over loops.
Example using mtcars data from R:
colc<-names(mtcars)
colcc<-combn(colc,2)
colcc<-data.frame(colcc)
kk<-Map(function(x)lm(as.formula(paste(colcc[1,x],"~",paste(colcc[2,x],collapse="+"))),data=mtcars), as.list(1:nrow(colcc)))
head(kk)
[[1]]
Call:
lm(formula = as.formula(paste(colcc[1, x], "~", paste(colcc[2,
x], collapse = "+"))), data = mtcars)
Coefficients:
(Intercept) cyl
37.885 -2.876
[[2]]
Call:
lm(formula = as.formula(paste(colcc[1, x], "~", paste(colcc[2,
x], collapse = "+"))), data = mtcars)
Coefficients:
(Intercept) disp
29.59985 -0.04122
[[3]]
Call:
lm(formula = as.formula(paste(colcc[1, x], "~", paste(colcc[2,
x], collapse = "+"))), data = mtcars)
Coefficients:
(Intercept) hp
30.09886 -0.06823
[[4]]
Call:
lm(formula = as.formula(paste(colcc[1, x], "~", paste(colcc[2,
x], collapse = "+"))), data = mtcars)
Coefficients:
(Intercept) drat
-7.525 7.678
[[5]]
Call:
lm(formula = as.formula(paste(colcc[1, x], "~", paste(colcc[2,
x], collapse = "+"))), data = mtcars)
Coefficients:
(Intercept) wt
37.285 -5.344
[[6]]
Call:
lm(formula = as.formula(paste(colcc[1, x], "~", paste(colcc[2,
x], collapse = "+"))), data = mtcars)
Coefficients:
(Intercept) qsec
-5.114 1.412
Upvotes: 1