Ian
Ian

Reputation: 185

two questions about creating lags within groups and with respect to time

I have a data frame x like this:

ID year    month     vol   sum_vol        
1   2000     1        1       6                 
1   2000     2        2       6                  
1   2000     3        3       6                  
1   2001     3        4       25                  
1   2001     4        5       25                  
1   2001     5        16      25                  
2   2000     1        7       24                
2   2000     2        8       24                 
2   2000     3        9       24                
2   2001     3        12      35                 
2   2001     4        11      35                 
2   2001     5        12      35                 
3   2000     1        13      42                 
3   2000     2        14      42                 
3   2000     3        15      42                 
3   2001     6        16      44          
3   2001     7        10      44
3   2001     8        18      44

And the desired output:

ID year    month     vol   sum_vol        lag_year_sum_vol      lag_2_month_vol
1   2000     1        1       6                  NA                    NA
1   2000     2        2       6                  NA                    NA
1   2000     3        3       6                  NA                    1
1   2001     3        4       25                  6                    NA
1   2001     4        5       25                  6                    NA
1   2001     5        16      25                  6                    4
2   2000     1        7       24                 NA                    NA
2   2000     2        8       24                 NA                    NA
2   2000     3        9       24                 NA                    7
2   2001     3        12      35                 24                    NA
2   2001     4        11      35                 24                    NA
2   2001     5        12      35                 24                    12
3   2000     1        13      42                 NA                    NA
3   2000     2        14      42                 NA                    NA
3   2000     3        15      42                 NA                    13
3   2001     6        16      44                 42                    NA
3   2001     7        10      44                 42                    NA
3   2001     8        18      44                 42                    16

I have searched a lot and not getting a conclusion:

So as you can see my questions are:

1) How to create a variable lag_year_sum_vol for one year per ID, which has a value of last year's sum_vol?

2) How to have lagged new variables lag_2_month_vol per customer per year, by 2 months?

NOTE: The ID, year and month in the real data may not be in this sorted order. And actually for any arbitrary number for year, month, and vol -> no pattern in the data.

I prefer a method in either dplyr or data.table. (It seems data.table is more concise.)

Thanks in advance!!

Upvotes: 0

Views: 58

Answers (2)

talat
talat

Reputation: 70286

Here's an option using data.table:

library(data.table)

# column 1
dt[dt[, .(ID, year = year +1, sum_vol)], on = .(ID, year), 
   lag_year_sum_vol := i.sum_vol]

# column 2
dt[dt[, .(ID, year, month = month+2, vol)], on = .(ID, year, month),
   lag_2_month_vol := i.vol]

As you can see, I temporarily modified the data, joined and updated the original data in both cases. There are, of course, other ways to do this using data.table.

The result is:

    ID year month vol sum_vol lag_year_sum_vol lag_2_month_vol
 1:  1 2000     1   1       6               NA              NA
 2:  1 2000     2   2       6               NA              NA
 3:  1 2000     3   3       6               NA               1
 4:  1 2001     3   4      25                6              NA
 5:  1 2001     4   5      25                6              NA
 6:  1 2001     5  16      25                6               4
 7:  2 2000     1   7      24               NA              NA
 8:  2 2000     2   8      24               NA              NA
 9:  2 2000     3   9      24               NA               7
10:  2 2001     3  12      35               24              NA
11:  2 2001     4  11      35               24              NA
12:  2 2001     5  12      35               24              12
13:  3 2000     1  13      42               NA              NA
14:  3 2000     2  14      42               NA              NA
15:  3 2000     3  15      42               NA              13
16:  3 2001     6  16      44               42              NA
17:  3 2001     7  10      44               42              NA
18:  3 2001     8  18      44               42              16

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 389047

Here's a way using dplyr :

library(dplyr)

df %>%
  #arrange data by ID, year and month
  arrange(ID, year, month) %>%
  #group by ID
  group_by(ID) %>%
  #Get previous value of sum_vol
  mutate(lag_year_sum_vol = lag(sum_vol)) %>%
  #group by ID and year
  group_by(year, .add = TRUE) %>%
  #For older dplyr use
  #group_by(year, add = TRUE) %>%
  #get previous 2 months vol
  mutate(lag_2_month_vol = lag(vol, 2), 
  #Except 1st row in each group replace everything with NA
         lag_year_sum_vol = replace(lag_year_sum_vol, -1, NA)) %>%
  #Fill with 1st value in group
  tidyr::fill(lag_year_sum_vol)

which returns

#      ID  year month   vol sum_vol lag_year_sum_vol lag_2_month_vol
#   <int> <int> <int> <int>   <int>            <int>           <int>
# 1     1  2000     1     1       6               NA              NA
# 2     1  2000     2     2       6               NA              NA
# 3     1  2000     3     3       6               NA               1
# 4     1  2001     3     4      25                6              NA
# 5     1  2001     4     5      25                6              NA
# 6     1  2001     5    16      25                6               4
# 7     2  2000     1     7      24               NA              NA
# 8     2  2000     2     8      24               NA              NA
# 9     2  2000     3     9      24               NA               7
#10     2  2001     3    12      35               24              NA
#11     2  2001     4    11      35               24              NA
#12     2  2001     5    12      35               24              12
#13     3  2000     1    13      42               NA              NA
#14     3  2000     2    14      42               NA              NA
#15     3  2000     3    15      42               NA              13
#16     3  2001     6    16      44               42              NA
#17     3  2001     7    10      44               42              NA
#18     3  2001     8    18      44               42              16

data

df <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(2000L, 2000L, 2000L, 
2001L, 2001L, 2001L, 2000L, 2000L, 2000L, 2001L, 2001L, 2001L, 
2000L, 2000L, 2000L, 2001L, 2001L, 2001L), month = c(1L, 2L, 
3L, 3L, 4L, 5L, 1L, 2L, 3L, 3L, 4L, 5L, 1L, 2L, 3L, 6L, 7L, 8L
), vol = c(1L, 2L, 3L, 4L, 5L, 16L, 7L, 8L, 9L, 12L, 11L, 12L, 
13L, 14L, 15L, 16L, 10L, 18L), sum_vol = c(6L, 6L, 6L, 25L, 25L, 
25L, 24L, 24L, 24L, 35L, 35L, 35L, 42L, 42L, 42L, 44L, 44L, 44L
)), class = "data.frame", row.names = c(NA, -18L))

Upvotes: 1

Related Questions