Alan
Alan

Reputation: 681

Constructing Monthly Snapshots of Time Stamped Data in R

A little problem I'm faced with. Idealised example here:

I've got a dataframe that records the dates that changes were made to the value of an insurance policy. This would look something like this:

df <- data.frame(policy_no  = c(1, 1, 1, 1, 2, 2, 2),
                 date_stamp = as.Date(c("2020-01-15", "2020-04-03", "2020-05-31", "2020-07-22", "2020-01-05", "2020-03-17", "2020-06-04" )),
                 value      = c(1000, 2000, 3000, 2000, 500, 1000, 500))

I've got two policies, policy_no 1 and 2. For each of these I'd like to write a bit of code to tell me the status of the value item as of the 1st of every month. So my output would look like

df_output <- data.frame(policy_no =  c(rep(1, 8), rep(2, 8)),
                        as_at_date = c(seq(as.Date("2020-01-01"), by = "month", length.out = 8), seq(as.Date("2020-01-01"), by = "month", length.out = 8)),
                        value = c(NA, 1000, 1000, 1000, 2000, 3000, 3000, 2000, NA, 500, 500, 1000, 1000, 1000, 500, 500))

enter image description here

Obviously I could write a loop to do this (for each policy_no, for each as_at_date, give me the value for the most recent snapshot where the date_stamp <= as_at_date) but that'd be slow & clunky. Can I vectorise this somehow?

tidyverse based approaches are particularly welcome, but grateful for any suggestions regardless.

Many thanks.

Upvotes: 1

Views: 143

Answers (2)

Ganesh Shastry
Ganesh Shastry

Reputation: 87

Edit: Updated the df sorting by policy and month Please find the solution below:

df <- data.frame(policy_no  = c(1, 1, 1, 1, 2, 2, 2,1,1,1, 2, 2, 2,1,1,1),
                 date_stamp = as.Date(c("2020-01-15", "2020-04-03", "2020-05-31", "2020-07-22", "2020-01-05", "2020-03-17", "2020-06-04", "2020-01-05", "2020-03-17", "2020-06-04","2020-03-01","2020-02-01","2020-01-01","2020-03-01","2020-02-01","2020-01-01" )),
                 value      = c(1000, 2000, 3000, 2000, 500, 1000, 500,500, 1000, 500, 500, 1000, 500,500, 1000, 500))
df$date <- format(as.Date(df$date_stamp,format="%Y-%m-%d"), format = "%d")
df$month <- format(as.Date(df$date_stamp,format="%Y-%m-%d"), format = "%m")

df_temp <- df[df$date == "01",]
df_temp <- df_temp[order(df_temp$policy_no,df_temp$month),1:3]

Upvotes: 0

Ben
Ben

Reputation: 30494

Here is one approach. First, create a monthly date sequence for each policy. Then, group_by both the policy and monthly date, and filter only those monthly dates beyond the date stamp. Finally, use slice to take the last row of data, containing the most recent value. Let me know if this is what you had in mind.

library(tidyverse)
library(lubridate)

df %>%
  group_by(policy_no) %>%
  mutate(as_at_date = list(seq(floor_date(min(date_stamp), unit = "month"), ceiling_date(max(date_stamp), unit = "month"), by = "month"))) %>%
  unnest(cols = as_at_date) %>%
  group_by(policy_no, as_at_date) %>%
  filter(as_at_date > date_stamp) %>%
  slice(tail(row_number(), 1))

If you want to include the NA for January, you can join with your date sequence, so all months will be included:

df_dates <- df %>%
  group_by(policy_no) %>%
  mutate(as_at_date = list(seq(floor_date(min(date_stamp), unit = "month"), ceiling_date(max(date_stamp), unit = "month"), by = "month"))) %>%
  unnest(cols = as_at_date) 

df_dates %>%
  group_by(policy_no, as_at_date) %>%
  filter(as_at_date > date_stamp) %>%
  slice(tail(row_number(), 1)) %>%
  right_join(df_dates %>% select(policy_no, as_at_date) %>% distinct()) %>%
  arrange(policy_no, as_at_date) 

Output

# A tibble: 15 x 4
# Groups:   policy_no, as_at_date [15]
   policy_no date_stamp value as_at_date
       <dbl> <date>     <dbl> <date>    
 1         1 NA            NA 2020-01-01
 2         1 2020-01-15  1000 2020-02-01
 3         1 2020-01-15  1000 2020-03-01
 4         1 2020-01-15  1000 2020-04-01
 5         1 2020-04-03  2000 2020-05-01
 6         1 2020-05-31  3000 2020-06-01
 7         1 2020-05-31  3000 2020-07-01
 8         1 2020-07-22  2000 2020-08-01
 9         2 NA            NA 2020-01-01
10         2 2020-01-05   500 2020-02-01
11         2 2020-01-05   500 2020-03-01
12         2 2020-03-17  1000 2020-04-01
13         2 2020-03-17  1000 2020-05-01
14         2 2020-03-17  1000 2020-06-01
15         2 2020-06-04   500 2020-07-01

Upvotes: 1

Related Questions