Reputation: 1251
I have this data:
library(dplyr)
glimpse(samp)
Observations: 15
Variables: 6
$ date <date> 2013-01-04, 2013-01-31, 2013-01-09, 2013-01-20, 2013-01-29, 2013...
$ shop_id <int> 4, 1, 30, 41, 26, 16, 25, 10, 29, 52, 54, 42, 8, 59, 31
$ item_id <int> 1904, 17880, 14439, 15010, 10917, 10331, 2751, 1475, 16071, 13901...
$ item_cnt_day <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1
$ month <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
It´s just a sample of a large data set, so there are jumps between the date
.
In the original data, the time series stars at 2013-01-01 and ends at 2015-11-30. The data are a time series. My goal is to calculate the lag
for one month. The problem is that the length of a month is not consistent (i.e. some months have 30 other have 31 days). In order to calculate the lag, I have to set a number. However, as I mentioned before for a month it´s not possible to set a fixed number. Is there a way to calculate the lag month wise?
The target variable is item_cnt_day
. The lag should be calculated for the rolling mean. In this example each month has 5 days so the result should like this:
library(RcppRoll)
library(dplyr)
samp %>%
mutate(r_mean_5 = lag(roll_meanr(item_cnt_day, 5), 1))
date shop_id item_id item_cnt_day month year r_mean_5
30717 2013-01-04 4 1904 1 1 2013 NA
43051 2013-01-31 1 17880 1 1 2013 NA
66273 2013-01-09 30 14439 1 1 2013 NA
105068 2013-01-20 41 15010 1 1 2013 NA
23332 2013-01-29 26 10917 1 1 2013 NA
28838 2013-02-22 16 10331 1 2 2013 1.0
40418 2013-02-08 25 2751 2 2 2013 1.0
62219 2013-02-12 10 1475 1 2 2013 1.2
98641 2013-02-16 29 16071 1 2 2013 1.2
21905 2013-02-23 52 13901 2 2 2013 1.2
32219 2013-03-31 54 2972 1 3 2013 1.4
45156 2013-03-17 42 11184 1 3 2013 1.4
69513 2013-03-24 8 19405 1 3 2013 1.2
110206 2013-03-10 59 2255 1 3 2013 1.2
24473 2013-03-07 31 15119 1 3 2013 1.2
Here is the dput()
.
structure(list(date = structure(c(15709, 15736, 15714, 15725,
15734, 15758, 15744, 15748, 15752, 15759, 15795, 15781, 15788,
15774, 15771), class = "Date"), shop_id = c(4L, 1L, 30L, 41L,
26L, 16L, 25L, 10L, 29L, 52L, 54L, 42L, 8L, 59L, 31L), item_id = c(1904L,
17880L, 14439L, 15010L, 10917L, 10331L, 2751L, 1475L, 16071L,
13901L, 2972L, 11184L, 19405L, 2255L, 15119L), item_cnt_day = c(1,
1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1), month = c(1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L)), row.names = c(30717L, 43051L,
66273L, 105068L, 23332L, 28838L, 40418L, 62219L, 98641L, 21905L,
32219L, 45156L, 69513L, 110206L, 24473L), class = "data.frame")
Upvotes: 0
Views: 1846
Reputation: 6441
Maybe this?
library(lubridate)
df$lag <- df$date %m-% months(1)
df$rollmean <- sapply(1:nrow(df), function(x) mean(df[df$date <= df$date[x] & df$date >= df$lag[x], "item_cnt_day" ]))
date shop_id item_id item_cnt_day month year lag rollmean
30717 2013-01-04 4 1904 1 1 2013 2012-12-04 1.000000
43051 2013-01-31 1 17880 1 1 2013 2012-12-31 1.000000
66273 2013-01-09 30 14439 1 1 2013 2012-12-09 1.000000
105068 2013-01-20 41 15010 1 1 2013 2012-12-20 1.000000
23332 2013-01-29 26 10917 1 1 2013 2012-12-29 1.000000
28838 2013-02-22 16 10331 1 2 2013 2013-01-22 1.166667
40418 2013-02-08 25 2751 2 2 2013 2013-01-08 1.200000
62219 2013-02-12 10 1475 1 2 2013 2013-01-12 1.200000
98641 2013-02-16 29 16071 1 2 2013 2013-01-16 1.166667
21905 2013-02-23 52 13901 2 2 2013 2013-01-23 1.285714
32219 2013-03-31 54 2972 1 3 2013 2013-02-28 1.000000
45156 2013-03-17 42 11184 1 3 2013 2013-02-17 1.200000
69513 2013-03-24 8 19405 1 3 2013 2013-02-24 1.000000
110206 2013-03-10 59 2255 1 3 2013 2013-02-10 1.166667
24473 2013-03-07 31 15119 1 3 2013 2013-02-07 1.333333
%m-%
calculates for every date the date one month ago, while accounting for different length of the months (31 days, 30 days, 28 days) and puts it into the column lag
. Then in sapply()
, the mean of item_cnt_day
is calculated for all observations whose date lies within the range of date
and lag
of the current iteration.
So it doesn't matter how many elements are there for each month or how the elements are ordered.
Upvotes: 2
Reputation: 2005
I am not really familiar calculating lag
, but maybe that is what you want?
Data:
df <- structure(list(date = structure(c(15709, 15736, 15714, 15725,
15734, 15758, 15744, 15748, 15752, 15759, 15795, 15781, 15788,
15774, 15771), class = "Date"), shop_id = c(4L, 1L, 30L, 41L,
26L, 16L, 25L, 10L, 29L, 52L, 54L, 42L, 8L, 59L, 31L), item_id = c(1904L,
17880L, 14439L, 15010L, 10917L, 10331L, 2751L, 1475L, 16071L,
13901L, 2972L, 11184L, 19405L, 2255L, 15119L), item_cnt_day = c(1,
1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1), month = c(1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L)), row.names = c(30717L, 43051L,
66273L, 105068L, 23332L, 28838L, 40418L, 62219L, 98641L, 21905L,
32219L, 45156L, 69513L, 110206L, 24473L), class = "data.frame")
Calculation:
df %>%
dplyr::mutate(days_in_month = lubridate::days_in_month(date)) %>%
tidyr::nest(-c(month, days_in_month)) %>%
dplyr::mutate(lag = purrr::map2(data, days_in_month, ~ stats::lag(.x$item_cnt_day, .y)))
EDIT based on comment:
maybe this then?
df %>%
tidyr::nest(-month) %>%
dplyr::mutate(
ndays = purrr::map_int(data, nrow),
lag = purrr::map2_dbl(data, ndays, ~ zoo::rollmean(.x$item_cnt_day, .y))
)
Upvotes: 1
Reputation: 1263
The date
class supports seq
for different time intervals (documentation).
So you can basically do:
calculate_lag <- function(date) {
return(seq(date, by = "1 month", length.out = 2)[2])
}
date_column <- as.Date(sapply( _YOUR_DATAFRAME_ , calculate_lag), origin="1970-01-01")
Upvotes: 2