Reputation: 181
My data are as follows:
group isoweek begin.date end.date days
A 201933 2019-08-04 2019-08-18 7
A 201934 2019-08-18 2019-08-29 7
A 201935 2019-08-18 2019-08-29 4
A 201936 2019-08-29 2019-09-14 7
A 201937 2019-08-29 2019-09-14 6
The days column currently counts the number of days data collection occurred in an isoweek given the begin.date and end.date range. However, I would like the days column to account for the date range in the previous and/or subsequent row. For example, isoweek 35 in the current dataframe indicates that 4 days of data collection occurred, however, isoweek 35 occurs from August 26th-Sept 1st (you can see that using the following code).
library(ISOweek)
> w <- paste("2019-W35", 1:7, sep = "-")
> data.frame(weekdate = w, date = ISOweek2date(w))
Therefore, given the date ranges of:
2019-08-18 to 2019-08-29
and
2019-08-29 to 2019-09-14,
isoweek 35 should indicate that 7 days of data collection occurred. In its current state, it accounts for only one row of date ranges. My desired output would be to summarise the number of days data collection occurred per isoweek, per group (there are multiple groups in my real data).
group isoweek days
A 201933 7
A 201934 7
A 201935 7
A 201936 7
A 201937 6
Thank you in advance!
Upvotes: 0
Views: 46
Reputation: 1812
I would tackle this in two steps: first create a list that holds, per group, all days that your data collection occurred (I will call them "active days"). Then, for each row in the df, calculate which days were in that respective ISO week, and look up how many of those are present in the list of active days for that group.
Example dataset (I added another group "B"), and required libraries:
library('dplyr')
library('lubridate')
library('purrr')
df <-
structure(
list(
group = c("A", "A", "A", "A", "A", "B"),
isoweek = c(201933L, 201934L, 201935L, 201936L, 201937L, 201937L),
begin.date = c("2019-08-04", "2019-08-18", "2019-08-18", "2019-08-29", "2019-08-29", "2019-09-15"),
end.date = c("2019-08-18", "2019-08-29", "2019-08-29", "2019-09-14", "2019-09-14", "2019-09-16"),
days = c(7L, 7L, 4L, 7L, 6L, 2L)
),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6")
)
active <- df %>%
# get days from begin.date through end.date per row
# (these will be stored as numerical instead of dates, but that's okay)
rowwise() %>%
mutate(days.in.range = list(ymd(begin.date):ymd(end.date))) %>%
# concatenate per group
group_by(group) %>%
summarise(days.in.range = list(unlist(days.in.range)))
# change from data frame to list
active.list <- active$days.in.range
names(active.list) <- active$group
rm(active)
Preview:
> active.list
$A
[1] 18112 18113 18114 18115 18116 18117 18118 18119 18120 18121 18122 18123 18124 18125 18126 18126 18127 18128 18129 18130 18131 18132 18133 18134
[25] 18135 18136 18137 18126 18127 18128 18129 18130 18131 18132 18133 18134 18135 18136 18137 18137 18138 18139 18140 18141 18142 18143 18144 18145
[49] 18146 18147 18148 18149 18150 18151 18152 18153 18137 18138 18139 18140 18141 18142 18143 18144 18145 18146 18147 18148 18149 18150 18151 18152
[73] 18153
$B
[1] 18154 18155
df
, get ISO week days and tally how many match with active.list
result <- df %>%
# create (temporary) columns for properly formatted ISO week, and the iso week first and last day
mutate(.isoweek = gsub('(\\d{4})(\\d+)', '\\1-W\\2', isoweek),
.isoweekfirst = ISOweek2date(paste0(.isoweek,'-',1)),
.isoweeklast = ISOweek2date(paste0(.isoweek,'-',7))) %>%
# create lists of days in ISO week (`.isodays`)
rowwise() %>%
mutate(.isodays = list(.isoweekfirst:.isoweeklast) ) %>%
ungroup() %>%
# for each row, check how many values in .isodays match with those for the respective group in `active.list`
mutate(active.days.in.iso.week = unlist(map2(.isodays,
group,
function(.isodays, group) sum(.isodays %in% active.list[[group]])))) #%>%
# optional: remove intermediate columns
#select(-.isoweek, -.isoweekfirst, -.isoweeklast, -.isodays)
Preview:
> result
# A tibble: 6 × 10
group isoweek begin.date end.date days .isoweek .isoweekfirst .isoweeklast .isodays active.days.in.iso.week
<chr> <int> <chr> <chr> <int> <chr> <date> <date> <list> <int>
1 A 201933 2019-08-04 2019-08-18 7 2019-W33 2019-08-12 2019-08-18 <int [7]> 7
2 A 201934 2019-08-18 2019-08-29 7 2019-W34 2019-08-19 2019-08-25 <int [7]> 7
3 A 201935 2019-08-18 2019-08-29 4 2019-W35 2019-08-26 2019-09-01 <int [7]> 7
4 A 201936 2019-08-29 2019-09-14 7 2019-W36 2019-09-02 2019-09-08 <int [7]> 7
5 A 201937 2019-08-29 2019-09-14 6 2019-W37 2019-09-09 2019-09-15 <int [7]> 6
6 B 201937 2019-09-15 2019-09-16 2 2019-W37 2019-09-09 2019-09-15 <int [7]> 1
The intended result is in the rightmost column. Intermediate columns and the active.list
list can now be removed.
Upvotes: 1