Reputation: 681
A little problem I'm faced with. Idealised example here:
I've got a dataframe that records the dates that changes were made to the value
of an insurance policy. This would look something like this:
df <- data.frame(policy_no = c(1, 1, 1, 1, 2, 2, 2),
date_stamp = as.Date(c("2020-01-15", "2020-04-03", "2020-05-31", "2020-07-22", "2020-01-05", "2020-03-17", "2020-06-04" )),
value = c(1000, 2000, 3000, 2000, 500, 1000, 500))
I've got two policies, policy_no
1 and 2. For each of these I'd like to write a bit of code to tell me the status of the value
item as of the 1st of every month. So my output would look like
df_output <- data.frame(policy_no = c(rep(1, 8), rep(2, 8)),
as_at_date = c(seq(as.Date("2020-01-01"), by = "month", length.out = 8), seq(as.Date("2020-01-01"), by = "month", length.out = 8)),
value = c(NA, 1000, 1000, 1000, 2000, 3000, 3000, 2000, NA, 500, 500, 1000, 1000, 1000, 500, 500))
Obviously I could write a loop to do this (for each policy_no
, for each as_at_date
, give me the value
for the most recent snapshot where the date_stamp <= as_at_date
) but that'd be slow & clunky. Can I vectorise this somehow?
tidyverse
based approaches are particularly welcome, but grateful for any suggestions regardless.
Many thanks.
Upvotes: 1
Views: 143
Reputation: 87
Edit: Updated the df sorting by policy and month Please find the solution below:
df <- data.frame(policy_no = c(1, 1, 1, 1, 2, 2, 2,1,1,1, 2, 2, 2,1,1,1),
date_stamp = as.Date(c("2020-01-15", "2020-04-03", "2020-05-31", "2020-07-22", "2020-01-05", "2020-03-17", "2020-06-04", "2020-01-05", "2020-03-17", "2020-06-04","2020-03-01","2020-02-01","2020-01-01","2020-03-01","2020-02-01","2020-01-01" )),
value = c(1000, 2000, 3000, 2000, 500, 1000, 500,500, 1000, 500, 500, 1000, 500,500, 1000, 500))
df$date <- format(as.Date(df$date_stamp,format="%Y-%m-%d"), format = "%d")
df$month <- format(as.Date(df$date_stamp,format="%Y-%m-%d"), format = "%m")
df_temp <- df[df$date == "01",]
df_temp <- df_temp[order(df_temp$policy_no,df_temp$month),1:3]
Upvotes: 0
Reputation: 30494
Here is one approach. First, create a monthly date sequence for each policy. Then, group_by
both the policy and monthly date, and filter only those monthly dates beyond the date stamp. Finally, use slice
to take the last row of data, containing the most recent value. Let me know if this is what you had in mind.
library(tidyverse)
library(lubridate)
df %>%
group_by(policy_no) %>%
mutate(as_at_date = list(seq(floor_date(min(date_stamp), unit = "month"), ceiling_date(max(date_stamp), unit = "month"), by = "month"))) %>%
unnest(cols = as_at_date) %>%
group_by(policy_no, as_at_date) %>%
filter(as_at_date > date_stamp) %>%
slice(tail(row_number(), 1))
If you want to include the NA
for January, you can join with your date sequence, so all months will be included:
df_dates <- df %>%
group_by(policy_no) %>%
mutate(as_at_date = list(seq(floor_date(min(date_stamp), unit = "month"), ceiling_date(max(date_stamp), unit = "month"), by = "month"))) %>%
unnest(cols = as_at_date)
df_dates %>%
group_by(policy_no, as_at_date) %>%
filter(as_at_date > date_stamp) %>%
slice(tail(row_number(), 1)) %>%
right_join(df_dates %>% select(policy_no, as_at_date) %>% distinct()) %>%
arrange(policy_no, as_at_date)
Output
# A tibble: 15 x 4
# Groups: policy_no, as_at_date [15]
policy_no date_stamp value as_at_date
<dbl> <date> <dbl> <date>
1 1 NA NA 2020-01-01
2 1 2020-01-15 1000 2020-02-01
3 1 2020-01-15 1000 2020-03-01
4 1 2020-01-15 1000 2020-04-01
5 1 2020-04-03 2000 2020-05-01
6 1 2020-05-31 3000 2020-06-01
7 1 2020-05-31 3000 2020-07-01
8 1 2020-07-22 2000 2020-08-01
9 2 NA NA 2020-01-01
10 2 2020-01-05 500 2020-02-01
11 2 2020-01-05 500 2020-03-01
12 2 2020-03-17 1000 2020-04-01
13 2 2020-03-17 1000 2020-05-01
14 2 2020-03-17 1000 2020-06-01
15 2 2020-06-04 500 2020-07-01
Upvotes: 1