user1642513
user1642513

Reputation:

run regression by groups and assign the fitted values and/or residuals back into original data.frame?

I am trying to get group-wise predictions from ols and assign the predictions back to a column in the original data.frame. In sas, this can be done by:

proc reg data = ds;
by group_var1 group_var2;
model y= x;
output out = ds r = resd p = pred;
run;

The code above will correctly assign the predictions to the pred variable and residuals to the resd variable in the correct "block" in the original data set. In R, the closest I have come to is something like the following:

d <- data.frame(x = rnorm(20), y = rnorm(20), g1 = c(rep('a', 10), rep('b', 10)), g2 = rep(c(rep('c', 5), rep('d', 5)), 2))
fun <- function(d) predict(lm(y ~ x, data = d), d)
d['predict'] <- unlist(by(d, d[, c('g1', 'g2')], fun))

             x           y g1 g2     predict
1  -0.53089730  0.26437295  a  c -0.01569909
2  -1.70298591 -0.58804638  a  c -0.01960800
3   0.31134574 -0.96575392  a  c -0.01289022
4   1.03821508  0.36409789  a  c -0.01046612
5  -1.05180195  0.84922972  a  c -0.01743631
6   0.94785058  0.16559779  a  d  0.55257659
7  -0.11779401 -2.31900972  a  d  0.65193420
8  -0.87618526 -2.29891776  a  d -0.54113668
9   0.28450262 -0.68698073  a  d -1.48699542
10  0.44388469 -1.54596297  a  d -1.22664063
11 -1.15656711  1.70991300  b  c -0.26021797
12 -1.26949128 -0.05968582  b  c -1.67447336
13  0.08648475 -1.56257791  b  c -2.68096176
14  1.16149361 -1.40203666  b  c -1.14057100
15  0.86558927 -0.73587454  b  c -0.92904930
16 -1.15168500  0.55817377  b  d  1.34287585
17  1.17898623 -0.84767449  b  d  0.12289709
18 -0.61372747  1.55786932  b  d  1.06128454
19 -0.31233192  1.15423216  b  d  0.90352047
20  0.61869842  1.42415426  b  d  0.41617707

However, I feel this approach is not robust because unlist simply flattens the output without any way to match the groups. I have definitely seen cases where the predictions get mis-allocated to other groups. Is there a clever and robust way to achieve this?

Upvotes: 4

Views: 1158

Answers (1)

agstudy
agstudy

Reputation: 121568

You have many option in R , for example using plyr

library(plyr)
ddply(d,.(g1,g2),fun)
 g1 g2          1           2          3          4          5
1  a  c  0.4218236 -0.20147871  0.4544890 -0.2330185  0.2070659
2  a  d  1.0728608  1.08735907  1.0239526  0.8740484  1.4750465
3  b  c -0.1288420  0.20727212  0.2407177  0.4017079 -0.5237893
4  b  d -0.6598871 -0.06777359 -0.2847552 -1.6400449 -0.2821023

Or using data.table

library(data.table)
DT <- as.data.table(d)
DT[,as.list(predict(lm(y~x))),by= c('g1', 'g2')]

  g1 g2          1           2          3          4          5
1:  a  c  0.4218236 -0.20147871  0.4544890 -0.2330185  0.2070659
2:  a  d  1.0728608  1.08735907  1.0239526  0.8740484  1.4750465
3:  b  c -0.1288420  0.20727212  0.2407177  0.4017079 -0.5237893
4:  b  d -0.6598871 -0.06777359 -0.2847552 -1.6400449 -0.2821023

Upvotes: 3

Related Questions