Banjo
Banjo

Reputation: 1251

Calculating lag over a month

I have this data:

library(dplyr)
glimpse(samp)
Observations: 15
Variables: 6
$ date         <date> 2013-01-04, 2013-01-31, 2013-01-09, 2013-01-20, 2013-01-29, 2013...
$ shop_id      <int> 4, 1, 30, 41, 26, 16, 25, 10, 29, 52, 54, 42, 8, 59, 31
$ item_id      <int> 1904, 17880, 14439, 15010, 10917, 10331, 2751, 1475, 16071, 13901...
$ item_cnt_day <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1
$ month        <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3
$ year         <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...

It´s just a sample of a large data set, so there are jumps between the date. In the original data, the time series stars at 2013-01-01 and ends at 2015-11-30. The data are a time series. My goal is to calculate the lag for one month. The problem is that the length of a month is not consistent (i.e. some months have 30 other have 31 days). In order to calculate the lag, I have to set a number. However, as I mentioned before for a month it´s not possible to set a fixed number. Is there a way to calculate the lag month wise?

The target variable is item_cnt_day. The lag should be calculated for the rolling mean. In this example each month has 5 days so the result should like this: library(RcppRoll) library(dplyr)

samp %>%
  mutate(r_mean_5 = lag(roll_meanr(item_cnt_day, 5), 1))
             date shop_id item_id item_cnt_day month year r_mean_5
30717  2013-01-04       4    1904            1     1 2013       NA
43051  2013-01-31       1   17880            1     1 2013       NA
66273  2013-01-09      30   14439            1     1 2013       NA
105068 2013-01-20      41   15010            1     1 2013       NA
23332  2013-01-29      26   10917            1     1 2013       NA
28838  2013-02-22      16   10331            1     2 2013      1.0
40418  2013-02-08      25    2751            2     2 2013      1.0
62219  2013-02-12      10    1475            1     2 2013      1.2
98641  2013-02-16      29   16071            1     2 2013      1.2
21905  2013-02-23      52   13901            2     2 2013      1.2
32219  2013-03-31      54    2972            1     3 2013      1.4
45156  2013-03-17      42   11184            1     3 2013      1.4
69513  2013-03-24       8   19405            1     3 2013      1.2
110206 2013-03-10      59    2255            1     3 2013      1.2
24473  2013-03-07      31   15119            1     3 2013      1.2

Here is the dput().

structure(list(date = structure(c(15709, 15736, 15714, 15725, 
15734, 15758, 15744, 15748, 15752, 15759, 15795, 15781, 15788, 
15774, 15771), class = "Date"), shop_id = c(4L, 1L, 30L, 41L, 
26L, 16L, 25L, 10L, 29L, 52L, 54L, 42L, 8L, 59L, 31L), item_id = c(1904L, 
17880L, 14439L, 15010L, 10917L, 10331L, 2751L, 1475L, 16071L, 
13901L, 2972L, 11184L, 19405L, 2255L, 15119L), item_cnt_day = c(1, 
1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1), month = c(1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(2013L, 
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 
2013L, 2013L, 2013L, 2013L, 2013L)), row.names = c(30717L, 43051L, 
66273L, 105068L, 23332L, 28838L, 40418L, 62219L, 98641L, 21905L, 
32219L, 45156L, 69513L, 110206L, 24473L), class = "data.frame")

Upvotes: 0

Views: 1846

Answers (3)

Humpelstielzchen
Humpelstielzchen

Reputation: 6441

Maybe this?

library(lubridate)

df$lag <- df$date %m-% months(1) 

df$rollmean <- sapply(1:nrow(df), function(x) mean(df[df$date <= df$date[x] & df$date >= df$lag[x], "item_cnt_day" ]))

             date shop_id item_id item_cnt_day month year        lag rollmean
30717  2013-01-04       4    1904            1     1 2013 2012-12-04 1.000000
43051  2013-01-31       1   17880            1     1 2013 2012-12-31 1.000000
66273  2013-01-09      30   14439            1     1 2013 2012-12-09 1.000000
105068 2013-01-20      41   15010            1     1 2013 2012-12-20 1.000000
23332  2013-01-29      26   10917            1     1 2013 2012-12-29 1.000000
28838  2013-02-22      16   10331            1     2 2013 2013-01-22 1.166667
40418  2013-02-08      25    2751            2     2 2013 2013-01-08 1.200000
62219  2013-02-12      10    1475            1     2 2013 2013-01-12 1.200000
98641  2013-02-16      29   16071            1     2 2013 2013-01-16 1.166667
21905  2013-02-23      52   13901            2     2 2013 2013-01-23 1.285714
32219  2013-03-31      54    2972            1     3 2013 2013-02-28 1.000000
45156  2013-03-17      42   11184            1     3 2013 2013-02-17 1.200000
69513  2013-03-24       8   19405            1     3 2013 2013-02-24 1.000000
110206 2013-03-10      59    2255            1     3 2013 2013-02-10 1.166667
24473  2013-03-07      31   15119            1     3 2013 2013-02-07 1.333333

%m-% calculates for every date the date one month ago, while accounting for different length of the months (31 days, 30 days, 28 days) and puts it into the column lag. Then in sapply(), the mean of item_cnt_day is calculated for all observations whose date lies within the range of date and lag of the current iteration.

So it doesn't matter how many elements are there for each month or how the elements are ordered.

Upvotes: 2

FMM
FMM

Reputation: 2005

I am not really familiar calculating lag, but maybe that is what you want?

Data:

df <- structure(list(date = structure(c(15709, 15736, 15714, 15725, 
                                  15734, 15758, 15744, 15748, 15752, 15759, 15795, 15781, 15788, 
                                  15774, 15771), class = "Date"), shop_id = c(4L, 1L, 30L, 41L, 
                                                                              26L, 16L, 25L, 10L, 29L, 52L, 54L, 42L, 8L, 59L, 31L), item_id = c(1904L, 
                                                                                                                                                 17880L, 14439L, 15010L, 10917L, 10331L, 2751L, 1475L, 16071L, 
                                                                                                                                                 13901L, 2972L, 11184L, 19405L, 2255L, 15119L), item_cnt_day = c(1, 
                                                                                                                                                                                                                 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1), month = c(1L, 1L, 
                                                                                                                                                                                                                                                                      1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(2013L, 
                                                                                                                                                                                                                                                                                                                                    2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 
                                                                                                                                                                                                                                                                                                                                    2013L, 2013L, 2013L, 2013L, 2013L)), row.names = c(30717L, 43051L, 
                                                                                                                                                                                                                                                                                                                                                                                       66273L, 105068L, 23332L, 28838L, 40418L, 62219L, 98641L, 21905L, 
                                                                                                                                                                                                                                                                                                                                                                                       32219L, 45156L, 69513L, 110206L, 24473L), class = "data.frame")

Calculation:

df %>% 
  dplyr::mutate(days_in_month = lubridate::days_in_month(date)) %>% 
  tidyr::nest(-c(month, days_in_month)) %>% 
  dplyr::mutate(lag = purrr::map2(data, days_in_month, ~ stats::lag(.x$item_cnt_day, .y)))

EDIT based on comment:

maybe this then?

df %>% 
  tidyr::nest(-month) %>% 
  dplyr::mutate(
    ndays = purrr::map_int(data, nrow),
    lag = purrr::map2_dbl(data, ndays, ~ zoo::rollmean(.x$item_cnt_day, .y))
    )

Upvotes: 1

georg.dev
georg.dev

Reputation: 1263

The date class supports seq for different time intervals (documentation).

So you can basically do:

calculate_lag <- function(date) {
  return(seq(date, by = "1 month", length.out = 2)[2])
}


date_column <- as.Date(sapply( _YOUR_DATAFRAME_ , calculate_lag), origin="1970-01-01")

Upvotes: 2

Related Questions