Reputation: 77
I have once again a question with respect to efficiently conduct some calculation without loops since working with a large data base.
The underlying unbalanced panel data set takes the form of df1 below. As can be seen, there are individuals (ID) observed for different years. Sometimes also with gab years (see ID 4):
library(data.table)
df1 = data.table( Year = c(2000, 2001,2002, 2003, 2000, 2001, 2000,
2001,2002, 2003, 2004, 2001,2003,2004),
ID = c(1,1,1,1,2,2,3,3,3,3,3,4,4,4),
V1 = c(10, 9, 8,11, 2,3, 16,12,15,16,20, 8,7,9),
V2 = c(2,3,2,4,0.5,1, 4,6,5,8,9, 8,10,11)
)
I would like to create a variable V3 in which for each ID V1/V2 is calculated for the first year of observation. For the following years of observation this value is maintained or copied. See below the desired output in df2:
df2 = data.table( Year = c(2000, 2001,2002, 2003, 2000, 2001, 2000,
2001,2002, 2003, 2004,2001,2003,2004),
ID = c(1,1,1,1,2,2,3,3,3,3,3,4,4,4),
V1 = c(10, 9, 8,11, 2,3, 12,16,15,16,20, 8,7,9),
V2 = c(2,3,2,4, 0.5,1, 4,6,5,8,9 ,8,10,11),
V3 = c(5,5,5,5,4,4,3,3,3,3,3, 1,1,1 )
)
Upvotes: 0
Views: 98
Reputation: 51592
You can use first
from data.table
to only do the calculation of the first Year (assuming your data is ordered), i.e.
library(data.table)
df1[, v3 := first(V1)/first(V2), by = ID][]
NOTE: You have some discrepancies with your data (row 8) hence the result is not identical to what you posted
Upvotes: 1