Reputation: 185
I have a data frame x
like this:
ID year month vol sum_vol
1 2000 1 1 6
1 2000 2 2 6
1 2000 3 3 6
1 2001 3 4 25
1 2001 4 5 25
1 2001 5 16 25
2 2000 1 7 24
2 2000 2 8 24
2 2000 3 9 24
2 2001 3 12 35
2 2001 4 11 35
2 2001 5 12 35
3 2000 1 13 42
3 2000 2 14 42
3 2000 3 15 42
3 2001 6 16 44
3 2001 7 10 44
3 2001 8 18 44
And the desired output:
ID year month vol sum_vol lag_year_sum_vol lag_2_month_vol
1 2000 1 1 6 NA NA
1 2000 2 2 6 NA NA
1 2000 3 3 6 NA 1
1 2001 3 4 25 6 NA
1 2001 4 5 25 6 NA
1 2001 5 16 25 6 4
2 2000 1 7 24 NA NA
2 2000 2 8 24 NA NA
2 2000 3 9 24 NA 7
2 2001 3 12 35 24 NA
2 2001 4 11 35 24 NA
2 2001 5 12 35 24 12
3 2000 1 13 42 NA NA
3 2000 2 14 42 NA NA
3 2000 3 15 42 NA 13
3 2001 6 16 44 42 NA
3 2001 7 10 44 42 NA
3 2001 8 18 44 42 16
I have searched a lot and not getting a conclusion:
So as you can see my questions are:
1) How to create a variable lag_year_sum_vol for one year per ID, which has a value of last year's sum_vol?
2) How to have lagged new variables lag_2_month_vol per customer per year, by 2 months?
NOTE: The ID, year and month in the real data may not be in this sorted order. And actually for any arbitrary number for year, month, and vol -> no pattern in the data.
I prefer a method in either dplyr or data.table. (It seems data.table is more concise.)
Thanks in advance!!
Upvotes: 0
Views: 58
Reputation: 70286
Here's an option using data.table:
library(data.table)
# column 1
dt[dt[, .(ID, year = year +1, sum_vol)], on = .(ID, year),
lag_year_sum_vol := i.sum_vol]
# column 2
dt[dt[, .(ID, year, month = month+2, vol)], on = .(ID, year, month),
lag_2_month_vol := i.vol]
As you can see, I temporarily modified the data, joined and updated the original data in both cases. There are, of course, other ways to do this using data.table.
The result is:
ID year month vol sum_vol lag_year_sum_vol lag_2_month_vol
1: 1 2000 1 1 6 NA NA
2: 1 2000 2 2 6 NA NA
3: 1 2000 3 3 6 NA 1
4: 1 2001 3 4 25 6 NA
5: 1 2001 4 5 25 6 NA
6: 1 2001 5 16 25 6 4
7: 2 2000 1 7 24 NA NA
8: 2 2000 2 8 24 NA NA
9: 2 2000 3 9 24 NA 7
10: 2 2001 3 12 35 24 NA
11: 2 2001 4 11 35 24 NA
12: 2 2001 5 12 35 24 12
13: 3 2000 1 13 42 NA NA
14: 3 2000 2 14 42 NA NA
15: 3 2000 3 15 42 NA 13
16: 3 2001 6 16 44 42 NA
17: 3 2001 7 10 44 42 NA
18: 3 2001 8 18 44 42 16
Upvotes: 2
Reputation: 389047
Here's a way using dplyr
:
library(dplyr)
df %>%
#arrange data by ID, year and month
arrange(ID, year, month) %>%
#group by ID
group_by(ID) %>%
#Get previous value of sum_vol
mutate(lag_year_sum_vol = lag(sum_vol)) %>%
#group by ID and year
group_by(year, .add = TRUE) %>%
#For older dplyr use
#group_by(year, add = TRUE) %>%
#get previous 2 months vol
mutate(lag_2_month_vol = lag(vol, 2),
#Except 1st row in each group replace everything with NA
lag_year_sum_vol = replace(lag_year_sum_vol, -1, NA)) %>%
#Fill with 1st value in group
tidyr::fill(lag_year_sum_vol)
which returns
# ID year month vol sum_vol lag_year_sum_vol lag_2_month_vol
# <int> <int> <int> <int> <int> <int> <int>
# 1 1 2000 1 1 6 NA NA
# 2 1 2000 2 2 6 NA NA
# 3 1 2000 3 3 6 NA 1
# 4 1 2001 3 4 25 6 NA
# 5 1 2001 4 5 25 6 NA
# 6 1 2001 5 16 25 6 4
# 7 2 2000 1 7 24 NA NA
# 8 2 2000 2 8 24 NA NA
# 9 2 2000 3 9 24 NA 7
#10 2 2001 3 12 35 24 NA
#11 2 2001 4 11 35 24 NA
#12 2 2001 5 12 35 24 12
#13 3 2000 1 13 42 NA NA
#14 3 2000 2 14 42 NA NA
#15 3 2000 3 15 42 NA 13
#16 3 2001 6 16 44 42 NA
#17 3 2001 7 10 44 42 NA
#18 3 2001 8 18 44 42 16
data
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(2000L, 2000L, 2000L,
2001L, 2001L, 2001L, 2000L, 2000L, 2000L, 2001L, 2001L, 2001L,
2000L, 2000L, 2000L, 2001L, 2001L, 2001L), month = c(1L, 2L,
3L, 3L, 4L, 5L, 1L, 2L, 3L, 3L, 4L, 5L, 1L, 2L, 3L, 6L, 7L, 8L
), vol = c(1L, 2L, 3L, 4L, 5L, 16L, 7L, 8L, 9L, 12L, 11L, 12L,
13L, 14L, 15L, 16L, 10L, 18L), sum_vol = c(6L, 6L, 6L, 25L, 25L,
25L, 24L, 24L, 24L, 35L, 35L, 35L, 42L, 42L, 42L, 44L, 44L, 44L
)), class = "data.frame", row.names = c(NA, -18L))
Upvotes: 1