Reputation: 1173
The problem: I would like to construct a variable that measures cumulative work experience within a person-year longitudinal data set. The problem applies to all sorts of longitudinal data sets and many variables might be constructed in this cumulative way (e.g., number of children, cumulative education, cumulative dollars spend on vacations, etc.)
The case: I have a large longitudinal data set in which every row constitutes a person year. The data set contains thousands of persons (variable “ID”) followed through their lives (variable “age”), resulting in a data frame with about 1.2 million rows. One variable indicates how many months a person has worked in each person year (variable “work”). For example, when Dan was 15 years old he worked 3 months.
ID age work
1 Dan 10 0
2 Dan 11 0
3 Dan 12 0
4 Dan 13 0
5 Dan 14 0
6 Dan 15 3
7 Dan 16 5
8 Dan 17 8
9 Dan 18 5
10 Dan 19 12
11 Jeff 20 0
12 Jeff 16 0
13 Jeff 17 0
14 Jeff 18 0
15 Jeff 19 0
16 Jeff 20 0
17 Jeff 21 8
18 Jeff 22 10
19 Jeff 23 12
20 Jeff 24 12
21 Jeff 25 12
22 Jeff 26 12
23 Jeff 27 12
24 Jeff 28 12
25 Jeff 29 12
I now want to construct a cumulative work experience variable, which adds the value of year x to year x+1. The goal is to know at each age of a person how many months they have worked in their entire carrier. The variable should look like “cumwork”.
ID age work cumwork
1 Dan 10 0 0
2 Dan 11 0 0
3 Dan 12 0 0
4 Dan 13 0 0
5 Dan 14 0 0
6 Dan 15 3 3
7 Dan 16 5 8
8 Dan 17 8 16
9 Dan 18 5 21
10 Dan 19 12 33
11 Jeff 20 0 0
12 Jeff 16 0 0
13 Jeff 17 0 0
14 Jeff 18 0 0
15 Jeff 19 0 0
16 Jeff 20 0 0
17 Jeff 21 8 8
18 Jeff 22 10 18
19 Jeff 23 12 30
20 Jeff 24 12 42
21 Jeff 25 12 54
22 Jeff 26 12 66
23 Jeff 27 12 78
24 Jeff 28 12 90
25 Jeff 29 12 102
A poor solution: I can construct such a cumulative variable using the following simple loop:
# Generate test data set
x=data.frame(ID=c(rep("Dan",times=10),rep("Jeff",times=15)),age=c(10:20,16:29),work=c(rep(0,times=5),3,5,8,5,12,rep(0,times=6),8,10,rep(12,times=7)),stringsAsFactors=F)
# Generate cumulative work experience variable
x$cumwork=x$work
for(r in 2:nrow(x)){
if(x$ID[r]==x$ID[r-1]){
x$cumwork[r]=x$cumwork[r-1]+x$cumwork[r]
}
}
However, my dataset has 1.2 million rows and looping through each row is highly inefficient and running this loop would take hours. Does any brilliant programmer have a suggestion of how to construct this cumulative measure most efficiently?
Many thanks in advance!
Best, Raphael
Upvotes: 3
Views: 347
Reputation: 193637
ave
is convenient for these types of tasks. The function you want to use with it is cumsum
:
x$cumwork <- ave(x$work, x$ID, FUN = cumsum)
x
# ID age work cumwork
# 1 Dan 10 0 0
# 2 Dan 11 0 0
# 3 Dan 12 0 0
# 4 Dan 13 0 0
# 5 Dan 14 0 0
# 6 Dan 15 3 3
# 7 Dan 16 5 8
# 8 Dan 17 8 16
# 9 Dan 18 5 21
# 10 Dan 19 12 33
# 11 Jeff 20 0 0
# 12 Jeff 16 0 0
# 13 Jeff 17 0 0
# 14 Jeff 18 0 0
# 15 Jeff 19 0 0
# 16 Jeff 20 0 0
# 17 Jeff 21 8 8
# 18 Jeff 22 10 18
# 19 Jeff 23 12 30
# 20 Jeff 24 12 42
# 21 Jeff 25 12 54
# 22 Jeff 26 12 66
# 23 Jeff 27 12 78
# 24 Jeff 28 12 90
# 25 Jeff 29 12 102
However, given the scale of your data, I would also strongly suggest the "data.table" package, which also gives you access to convenient syntax:
library(data.table)
DT <- data.table(x)
DT[, cumwork := cumsum(work), by = ID]
Upvotes: 2