Reputation: 85
I need to create a new variable "controldates" from a date variable "casedates". This new variable is going to consist of dates that are on the same day of the week as the casedate, within the same month and year as the case date. For example if I have a case date on the 3rd Wednesday of July my control days will be the first 1st Wednesday of July, the second Wednesday of July, and the 4th Wednesday of July. Additionally, I would like to create an indicator variable for each group of dates that are created. I would like to do this using dplyr in r.
Starting data:
Casedate
"01-03-2015"
"08-27-2017"
"10-23-2019"
This is how I would like it to look
Casedate Controldate Index
"01-03-2015" "01-03-2015" 1
"01-03-2015" "01-10-2015" 1
"01-03-2015" "01-17-2015" 1
"01-03-2015" "01-24-2015" 1
"01-03-2015" "01-31-2015" 1
"08-12-2017" "08-05-2017" 2
"08-12-2017" "08-12-2017" 2
"08-12-2017" "08-19-2017" 2
"08-12-2017" "08-26-2017" 2
"10-23-2019" "10-02-2019" 3
"10-23-2019" "10-09-2019" 3
"10-23-2019" "10-16-2019" 3
"10-23-2019" "10-23-2019" 3
"10-23-2019" "10-30-2019" 3
Upvotes: 5
Views: 565
Reputation: 93813
Since there can only at most be 4 weeks prior or 4 weeks after a date within a month (9 values total), you can get away with calculating that range all in one go with some sequences. That should avoid the need for looping over every value explicitly.
After calculating the values, then subset to those in the same month as the original value in a single sweep. Using @akrun's df1
example data from below:
d <- as.Date(df1$Casedate, format="%m-%d-%Y")
r <- rep(d, each=9)
o <- r + (7 * -4:4)
i <- rep(seq_along(d), each=9)
s <- format(o, "%m") == format(r, "%m")
data.frame(
Casedate = r,
Controldate = o,
Index = i
)[s,]
# Casedate Controldate Index
#5 2015-01-03 2015-01-03 1
#6 2015-01-03 2015-01-10 1
#7 2015-01-03 2015-01-17 1
#8 2015-01-03 2015-01-24 1
#9 2015-01-03 2015-01-31 1
#11 2017-08-27 2017-08-06 2
#12 2017-08-27 2017-08-13 2
#13 2017-08-27 2017-08-20 2
#14 2017-08-27 2017-08-27 2
#20 2019-10-23 2019-10-02 3
#21 2019-10-23 2019-10-09 3
#22 2019-10-23 2019-10-16 3
#23 2019-10-23 2019-10-23 3
#24 2019-10-23 2019-10-30 3
If you want to keep all of the original variables in the dataset, it is a simple fix:
cbind(
df1[i,],
data.frame(Controldate = o, Index = i)
)[s,]
E.g.:
# Casedate othvar1 othvar2 Controldate Index
#1.4 01-03-2015 a B 2015-01-03 1
#1.5 01-03-2015 a B 2015-01-10 1
#1.6 01-03-2015 a B 2015-01-17 1
#1.7 01-03-2015 a B 2015-01-24 1
#...
Even on a moderately large dataset (300K rows), there is a meaningful difference in timing between generating sequence runs (2 seconds) and looping over each value (2 minutes):
Sequence:
df1 <- df1[rep(1:3,each=1e5),,drop=FALSE]
system.time({
d <- as.Date(df1$Casedate, format="%m-%d-%Y")
r <- rep(d, each=9)
o <- r + (7 * -4:4)
i <- rep(seq_along(d), each=9)
s <- format(o, "%m") == format(r, "%m")
data.frame(
Casedate = r,
Controldate = o,
Index = i
)[s,]
})
# user system elapsed
# 1.909 0.128 2.038
Looping:
library(dplyr)
library(purrr)
library(lubridate)
system.time({
df1 %>%
mutate(Index = row_number(),
Casedate = mdy(Casedate),
wd = wday(Casedate, label = TRUE),
Controldate = map2(floor_date(Casedate, 'month'), wd, ~ {
x1 <- seq(.x, length.out = 7, by = '1 day')
seq(x1[wday(x1, label = TRUE) == .y],
ceiling_date(.x, 'month'), by = '7 day')})) %>%
unnest(Controldate) %>%
select(Casedate, Controldate, Index)
})
# user system elapsed
# 131.466 1.143 132.623
Upvotes: 1
Reputation: 886968
Here is an option with tidyverse
. Convert the 'Casedate' to Date
class with lubridate
, then loop over the elements with map
, create a seq
uence of dates in a list
, unnest
the list
column
library(dplyr)
library(purrr)
library(lubridate)
df1 %>%
mutate(Index = row_number(),
Casedate = mdy(Casedate),
wd = wday(Casedate, label = TRUE),
Controldate = map2(floor_date(Casedate, 'month'), wd, ~ {
x1 <- seq(.x, length.out = 7, by = '1 day')
seq(x1[wday(x1, label = TRUE) == .y],
ceiling_date(.x, 'month'), by = '7 day')})) %>%
unnest(c(Controldate)) %>%
select(Casedate, Controldate, Index)
-output
# A tibble: 14 x 3
# Casedate Controldate Index
# <date> <date> <int>
# 1 2015-01-03 2015-01-03 1
# 2 2015-01-03 2015-01-10 1
# 3 2015-01-03 2015-01-17 1
# 4 2015-01-03 2015-01-24 1
# 5 2015-01-03 2015-01-31 1
# 6 2017-08-27 2017-08-06 2
# 7 2017-08-27 2017-08-13 2
# 8 2017-08-27 2017-08-20 2
# 9 2017-08-27 2017-08-27 2
#10 2019-10-23 2019-10-02 3
#11 2019-10-23 2019-10-09 3
#12 2019-10-23 2019-10-16 3
#13 2019-10-23 2019-10-23 3
#14 2019-10-23 2019-10-30 3
df1 <- structure(list(Casedate = c("01-03-2015", "08-27-2017", "10-23-2019"
)), class = "data.frame", row.names = c(NA, -3L))
Upvotes: 1