Sara
Sara

Reputation: 85

Creating a new date variable that is on the same day of the week, within the same month, and year as original date variable in r

I need to create a new variable "controldates" from a date variable "casedates". This new variable is going to consist of dates that are on the same day of the week as the casedate, within the same month and year as the case date. For example if I have a case date on the 3rd Wednesday of July my control days will be the first 1st Wednesday of July, the second Wednesday of July, and the 4th Wednesday of July. Additionally, I would like to create an indicator variable for each group of dates that are created. I would like to do this using dplyr in r.

Starting data:

Casedate
 "01-03-2015"
 "08-27-2017"
 "10-23-2019"

This is how I would like it to look

Casedate          Controldate      Index
"01-03-2015"      "01-03-2015"       1
"01-03-2015"      "01-10-2015"       1
"01-03-2015"      "01-17-2015"       1
"01-03-2015"      "01-24-2015"       1
"01-03-2015"      "01-31-2015"       1
"08-12-2017"      "08-05-2017"       2
"08-12-2017"      "08-12-2017"       2
"08-12-2017"      "08-19-2017"       2
"08-12-2017"      "08-26-2017"       2
"10-23-2019"      "10-02-2019"       3
"10-23-2019"      "10-09-2019"       3
"10-23-2019"      "10-16-2019"       3
"10-23-2019"      "10-23-2019"       3
"10-23-2019"      "10-30-2019"       3

Upvotes: 5

Views: 565

Answers (2)

thelatemail
thelatemail

Reputation: 93813

Since there can only at most be 4 weeks prior or 4 weeks after a date within a month (9 values total), you can get away with calculating that range all in one go with some sequences. That should avoid the need for looping over every value explicitly.

After calculating the values, then subset to those in the same month as the original value in a single sweep. Using @akrun's df1 example data from below:

d  <- as.Date(df1$Casedate, format="%m-%d-%Y")
r  <- rep(d, each=9)
o  <- r + (7 * -4:4)
i  <- rep(seq_along(d), each=9)
s  <- format(o, "%m") == format(r, "%m")

data.frame(
    Casedate = r,
    Controldate = o,
    Index = i
)[s,]

#     Casedate Controldate Index
#5  2015-01-03  2015-01-03     1
#6  2015-01-03  2015-01-10     1
#7  2015-01-03  2015-01-17     1
#8  2015-01-03  2015-01-24     1
#9  2015-01-03  2015-01-31     1
#11 2017-08-27  2017-08-06     2
#12 2017-08-27  2017-08-13     2
#13 2017-08-27  2017-08-20     2
#14 2017-08-27  2017-08-27     2
#20 2019-10-23  2019-10-02     3
#21 2019-10-23  2019-10-09     3
#22 2019-10-23  2019-10-16     3
#23 2019-10-23  2019-10-23     3
#24 2019-10-23  2019-10-30     3

If you want to keep all of the original variables in the dataset, it is a simple fix:

cbind(
  df1[i,],
  data.frame(Controldate = o, Index = i)
)[s,]

E.g.:

#      Casedate othvar1 othvar2 Controldate Index
#1.4 01-03-2015       a       B  2015-01-03     1
#1.5 01-03-2015       a       B  2015-01-10     1
#1.6 01-03-2015       a       B  2015-01-17     1
#1.7 01-03-2015       a       B  2015-01-24     1
#...

Even on a moderately large dataset (300K rows), there is a meaningful difference in timing between generating sequence runs (2 seconds) and looping over each value (2 minutes):

Sequence:

df1 <- df1[rep(1:3,each=1e5),,drop=FALSE]

system.time({
d  <- as.Date(df1$Casedate, format="%m-%d-%Y")
r  <- rep(d, each=9)
o  <- r + (7 * -4:4)
i  <- rep(seq_along(d), each=9)
s  <- format(o, "%m") == format(r, "%m")

data.frame(
    Casedate = r,
    Controldate = o,
    Index = i
)[s,]
})

#   user  system elapsed 
#  1.909   0.128   2.038 

Looping:

library(dplyr)
library(purrr)
library(lubridate)

system.time({
df1 %>% 
   mutate(Index = row_number(), 
      Casedate = mdy(Casedate), 
     wd = wday(Casedate, label = TRUE), 
     Controldate = map2(floor_date(Casedate, 'month'), wd, ~ {
   x1 <- seq(.x, length.out = 7, by = '1 day')
    seq(x1[wday(x1, label = TRUE) == .y],
       ceiling_date(.x, 'month'), by = '7 day')})) %>% 
    unnest(Controldate) %>%
    select(Casedate, Controldate, Index)
})

#    user  system elapsed 
# 131.466   1.143 132.623

Upvotes: 1

akrun
akrun

Reputation: 886968

Here is an option with tidyverse. Convert the 'Casedate' to Date class with lubridate, then loop over the elements with map, create a sequence of dates in a list, unnest the list column

library(dplyr)
library(purrr)
library(lubridate)
df1 %>% 
   mutate(Index = row_number(), 
      Casedate = mdy(Casedate), 
     wd = wday(Casedate, label = TRUE), 
     Controldate = map2(floor_date(Casedate, 'month'), wd, ~ {
   x1 <- seq(.x, length.out = 7, by = '1 day')
    seq(x1[wday(x1, label = TRUE) == .y],
       ceiling_date(.x, 'month'), by = '7 day')})) %>% 
    unnest(c(Controldate)) %>%
    select(Casedate, Controldate, Index)

-output

# A tibble: 14 x 3
#   Casedate   Controldate Index
#   <date>     <date>      <int>
# 1 2015-01-03 2015-01-03      1
# 2 2015-01-03 2015-01-10      1
# 3 2015-01-03 2015-01-17      1
# 4 2015-01-03 2015-01-24      1
# 5 2015-01-03 2015-01-31      1
# 6 2017-08-27 2017-08-06      2
# 7 2017-08-27 2017-08-13      2
# 8 2017-08-27 2017-08-20      2
# 9 2017-08-27 2017-08-27      2
#10 2019-10-23 2019-10-02      3
#11 2019-10-23 2019-10-09      3
#12 2019-10-23 2019-10-16      3
#13 2019-10-23 2019-10-23      3
#14 2019-10-23 2019-10-30      3

data

df1 <- structure(list(Casedate = c("01-03-2015", "08-27-2017", "10-23-2019"
)), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 1

Related Questions