Mace
Mace

Reputation: 1269

Doing several cross sectional regression of unbalanced panel data in R

I have an unbalanced panel data set. The following data will do for illustration:

Id <- c(rep(1:4,3),rep(5,2))
Id <- Id[order(Id)]
Year <- c(rep(2000:2002,4),c(2000,2002))

z1 <- rnorm(14)
z2 <- rnorm(14)
z3 <- rnorm(14)
z4 <- rnorm(14)

CORR <- rbind(c(1,0.6,0.5,0.2),c(0.6,1,0.7,0.3),c(0.5,0.7,1,0.4),c(0.2,0.3,0.4,1))
CholCORR <- chol(CORR)
DataTest <- as.data.frame(cbind(z1,z2,z3,z4)%*%CholCORR)
names(DataTest)<-c("y","x1","x2","x3")
DataTest <- cbind(Id, Year, DataTest)
DataTest

   Id Year          y         x1         x2         x3
1   1 2000 -0.7463355 -1.1920928 -1.2358912 -0.2527170
2   1 2001 -0.3475260 -0.1729497 -0.6252036  0.4366446
3   1 2002  0.6815678  0.1775869  0.2860103  0.8479373
4   2 2000  0.7927199  1.2830142  1.7018747  2.4475020
5   2 2001  0.5157535  0.4365100  1.1512340  1.0882581
6   2 2002 -1.2806121  0.3392759  0.5781835  0.2829615
7   3 2000  1.8741119  0.3904028  0.7984584 -1.7015025
8   3 2001 -0.4799960  0.6397883  0.4719150  0.8601328
9   3 2002 -2.2031991  0.3789198 -0.1143526  0.6771387
10  4 2000 -1.0634857 -0.6171644 -0.5732400  0.7718195
11  4 2001  0.3266059 -0.6882776 -0.2013544 -0.7242561
12  4 2002  0.2921222 -0.5149802  0.6180026 -0.4417939
13  5 2000 -2.2447755 -1.3328675 -1.6370130  0.4537451
14  5 2002 -0.9959034 -0.5958327 -0.3408927  0.2162799

I want to do one cross section linear regression for each year (3 regressions) and save the estimated coefficients, but it's complicated by the fact that the panel is unbalanced (Id 5 is missing an observation for Year 2001) and because I want to estimate the following dynamic formula with lag and diff:

formula(diff(y) ~ lag(x1) + x2 + x3)

I have thought of using the plm, reshape, or plyr package, but I can't find an efficient to do it when I want to use lag and diff with my unbalanced panel.

Thanks,

M

Upvotes: 1

Views: 1767

Answers (1)

Mace
Mace

Reputation: 1269

I found a solution that is pretty efficient. It uses split and lapply as suggested by statquant above, but also uses the plm package in order to compute the difference of an unbalanced panel data set.

library(plm)

Id <- c(rep(1:4,3),rep(5,2))
Id <- Id[order(Id)]
Year <- c(rep(2000:2002,4),c(2000,2002))

z1 <- rnorm(14)
z2 <- rnorm(14)
z3 <- rnorm(14)
z4 <- rnorm(14)

CORR <- rbind(c(1,0.6,0.5,0.2),c(0.6,1,0.7,0.3),c(0.5,0.7,1,0.4),c(0.2,0.3,0.4,1))
CholCORR <- chol(CORR)
DataTest <- as.data.frame(cbind(z1,z2,z3,z4)%*%CholCORR)
names(DataTest)<-c("y","x1","x2","x3")
DataTest <- cbind(Id, Year, DataTest)
DataTest

       Id Year          y         x1         x2          x3   
1-2000  1 2000 -0.3837477  0.3065426  1.0646871  0.23757223   
1-2001  1 2001  1.2804333 -0.2015468 -0.2769726 -0.63032551  
1-2002  1 2002 -0.3242049 -1.3518821 -0.4720256  0.08556161 
2-2000  2 2000 -0.3298273 -0.4354473  0.3588493  0.80121465        
2-2001  2 2001 -0.5556866  1.1987959  1.6196555  1.28593473 
2-2002  2 2002  0.2861269  0.2921481  1.1051309  1.66204274  
3-2000  3 2000  0.9224208  0.4255198  0.8947040  1.11784735         
3-2001  3 2001 -1.1052755 -1.7078627 -1.9503432 -2.82343057 
3-2002  3 2002 -2.3020849 -0.8078460 -0.2692165  0.64940791 
4-2000  4 2000  2.5565427  1.7034472  2.2688046  1.71329610         
4-2001  4 2001  0.7015629  0.8518173 -0.2490498  0.70293713 
4-2002  4 2002  0.9454791  0.8830710  0.7355299  0.51836849  
5-2000  5 2000  0.4867604  0.3301825  0.7929939  0.55197991        
5-2002  5 2002 -0.5859263 -0.1164737  0.5831942 -0.03307241 

pDataTest <- pdata.frame(DataTest, index = c('Id','Year'))


pDataTest$yDiff <- diff(pDataTest$y)
pDataTest

       Id Year          y         x1         x2          x3      yDiff
1-2000  1 2000 -0.3837477  0.3065426  1.0646871  0.23757223         NA
1-2001  1 2001  1.2804333 -0.2015468 -0.2769726 -0.63032551  1.6641810
1-2002  1 2002 -0.3242049 -1.3518821 -0.4720256  0.08556161 -1.6046382
2-2000  2 2000 -0.3298273 -0.4354473  0.3588493  0.80121465         NA
2-2001  2 2001 -0.5556866  1.1987959  1.6196555  1.28593473 -0.2258592
2-2002  2 2002  0.2861269  0.2921481  1.1051309  1.66204274  0.8418135
3-2000  3 2000  0.9224208  0.4255198  0.8947040  1.11784735         NA
3-2001  3 2001 -1.1052755 -1.7078627 -1.9503432 -2.82343057 -2.0276963
3-2002  3 2002 -2.3020849 -0.8078460 -0.2692165  0.64940791 -1.1968093
4-2000  4 2000  2.5565427  1.7034472  2.2688046  1.71329610         NA
4-2001  4 2001  0.7015629  0.8518173 -0.2490498  0.70293713 -1.8549798
4-2002  4 2002  0.9454791  0.8830710  0.7355299  0.51836849  0.2439162
5-2000  5 2000  0.4867604  0.3301825  0.7929939  0.55197991         NA
5-2002  5 2002 -0.5859263 -0.1164737  0.5831942 -0.03307241         NA

Temp <-split(pDataTest,'Year')

MyFormula <- formula(yDiff ~ lag(x1) + x2 + x3)
Fit <- lapply(Temp[-1], lm, formula=MyFormula)
Fit

$`2001`

Call:
FUN(formula = ..1, data = X[[1L]])

Coefficients:
(Intercept)      lag(x1)           x2           x3  
      15.27       -52.88        -1.53        39.17  


$`2002`

Call:
FUN(formula = ..1, data = X[[2L]])

Coefficients:
(Intercept)      lag(x1)           x2           x3  
    -0.8738       0.1038       1.2805       0.1626  

Upvotes: 2

Related Questions