Reputation:
I am trying to get group-wise predictions from ols and assign the predictions back to a column in the original data.frame. In sas, this can be done by:
proc reg data = ds;
by group_var1 group_var2;
model y= x;
output out = ds r = resd p = pred;
run;
The code above will correctly assign the predictions to the pred
variable and residuals to the resd
variable in the correct "block" in the original data set. In R, the closest I have come to is something like the following:
d <- data.frame(x = rnorm(20), y = rnorm(20), g1 = c(rep('a', 10), rep('b', 10)), g2 = rep(c(rep('c', 5), rep('d', 5)), 2))
fun <- function(d) predict(lm(y ~ x, data = d), d)
d['predict'] <- unlist(by(d, d[, c('g1', 'g2')], fun))
x y g1 g2 predict
1 -0.53089730 0.26437295 a c -0.01569909
2 -1.70298591 -0.58804638 a c -0.01960800
3 0.31134574 -0.96575392 a c -0.01289022
4 1.03821508 0.36409789 a c -0.01046612
5 -1.05180195 0.84922972 a c -0.01743631
6 0.94785058 0.16559779 a d 0.55257659
7 -0.11779401 -2.31900972 a d 0.65193420
8 -0.87618526 -2.29891776 a d -0.54113668
9 0.28450262 -0.68698073 a d -1.48699542
10 0.44388469 -1.54596297 a d -1.22664063
11 -1.15656711 1.70991300 b c -0.26021797
12 -1.26949128 -0.05968582 b c -1.67447336
13 0.08648475 -1.56257791 b c -2.68096176
14 1.16149361 -1.40203666 b c -1.14057100
15 0.86558927 -0.73587454 b c -0.92904930
16 -1.15168500 0.55817377 b d 1.34287585
17 1.17898623 -0.84767449 b d 0.12289709
18 -0.61372747 1.55786932 b d 1.06128454
19 -0.31233192 1.15423216 b d 0.90352047
20 0.61869842 1.42415426 b d 0.41617707
However, I feel this approach is not robust because unlist
simply flattens the output without any way to match the groups. I have definitely seen cases where the predictions get mis-allocated to other groups. Is there a clever and robust way to achieve this?
Upvotes: 4
Views: 1158
Reputation: 121568
You have many option in R , for example using plyr
library(plyr)
ddply(d,.(g1,g2),fun)
g1 g2 1 2 3 4 5
1 a c 0.4218236 -0.20147871 0.4544890 -0.2330185 0.2070659
2 a d 1.0728608 1.08735907 1.0239526 0.8740484 1.4750465
3 b c -0.1288420 0.20727212 0.2407177 0.4017079 -0.5237893
4 b d -0.6598871 -0.06777359 -0.2847552 -1.6400449 -0.2821023
Or using data.table
library(data.table)
DT <- as.data.table(d)
DT[,as.list(predict(lm(y~x))),by= c('g1', 'g2')]
g1 g2 1 2 3 4 5
1: a c 0.4218236 -0.20147871 0.4544890 -0.2330185 0.2070659
2: a d 1.0728608 1.08735907 1.0239526 0.8740484 1.4750465
3: b c -0.1288420 0.20727212 0.2407177 0.4017079 -0.5237893
4: b d -0.6598871 -0.06777359 -0.2847552 -1.6400449 -0.2821023
Upvotes: 3