novice_coder
novice_coder

Reputation: 181

Summarize the number of days that occur in an isoweek given multiple date ranges from multiple rows

My data are as follows:

group isoweek     begin.date  end.date    days
A     201933      2019-08-04  2019-08-18  7
A     201934      2019-08-18  2019-08-29  7
A     201935      2019-08-18  2019-08-29  4
A     201936      2019-08-29  2019-09-14  7
A     201937      2019-08-29  2019-09-14  6

The days column currently counts the number of days data collection occurred in an isoweek given the begin.date and end.date range. However, I would like the days column to account for the date range in the previous and/or subsequent row. For example, isoweek 35 in the current dataframe indicates that 4 days of data collection occurred, however, isoweek 35 occurs from August 26th-Sept 1st (you can see that using the following code).

library(ISOweek)
> w <- paste("2019-W35", 1:7, sep = "-")
> data.frame(weekdate = w, date = ISOweek2date(w))

Therefore, given the date ranges of:

2019-08-18 to 2019-08-29

and

2019-08-29 to 2019-09-14,

isoweek 35 should indicate that 7 days of data collection occurred. In its current state, it accounts for only one row of date ranges. My desired output would be to summarise the number of days data collection occurred per isoweek, per group (there are multiple groups in my real data).

group isoweek days 
A     201933  7
A     201934  7
A     201935  7
A     201936  7
A     201937  6

Thank you in advance!

Upvotes: 0

Views: 46

Answers (1)

Caspar V.
Caspar V.

Reputation: 1812

I would tackle this in two steps: first create a list that holds, per group, all days that your data collection occurred (I will call them "active days"). Then, for each row in the df, calculate which days were in that respective ISO week, and look up how many of those are present in the list of active days for that group.

Step 0: preparation

Example dataset (I added another group "B"), and required libraries:

library('dplyr')
library('lubridate')
library('purrr')

df <-
  structure(
    list(
      group = c("A", "A", "A", "A", "A", "B"),
      isoweek = c(201933L, 201934L, 201935L, 201936L, 201937L, 201937L),
      begin.date = c("2019-08-04", "2019-08-18", "2019-08-18", "2019-08-29", "2019-08-29", "2019-09-15"),
      end.date = c("2019-08-18", "2019-08-29", "2019-08-29", "2019-09-14", "2019-09-14", "2019-09-16"),
      days = c(7L, 7L, 4L, 7L, 6L, 2L)
    ),
    class = "data.frame",
    row.names = c("1", "2", "3", "4", "5", "6")
  )

Step 1: create list of active days per group

active <- df %>%
  
  # get days from begin.date through end.date per row
  # (these will be stored as numerical instead of dates, but that's okay)
  rowwise() %>%
  mutate(days.in.range = list(ymd(begin.date):ymd(end.date))) %>%
  
  # concatenate per group
  group_by(group) %>%
  summarise(days.in.range = list(unlist(days.in.range)))

# change from data frame to list
active.list <- active$days.in.range
names(active.list) <- active$group
rm(active)

Preview:

> active.list
$A
 [1] 18112 18113 18114 18115 18116 18117 18118 18119 18120 18121 18122 18123 18124 18125 18126 18126 18127 18128 18129 18130 18131 18132 18133 18134
[25] 18135 18136 18137 18126 18127 18128 18129 18130 18131 18132 18133 18134 18135 18136 18137 18137 18138 18139 18140 18141 18142 18143 18144 18145
[49] 18146 18147 18148 18149 18150 18151 18152 18153 18137 18138 18139 18140 18141 18142 18143 18144 18145 18146 18147 18148 18149 18150 18151 18152
[73] 18153

$B
[1] 18154 18155

Step 2: for each row in df, get ISO week days and tally how many match with active.list

result <- df %>%

  # create (temporary) columns for properly formatted ISO week, and the iso week first and last day
  mutate(.isoweek = gsub('(\\d{4})(\\d+)', '\\1-W\\2', isoweek),
         .isoweekfirst = ISOweek2date(paste0(.isoweek,'-',1)),
         .isoweeklast = ISOweek2date(paste0(.isoweek,'-',7))) %>%
  
  # create lists of days in ISO week (`.isodays`)
  rowwise() %>%
  mutate(.isodays = list(.isoweekfirst:.isoweeklast) ) %>%
  ungroup() %>%
  
  # for each row, check how many values in .isodays match with those for the respective group in `active.list`
  mutate(active.days.in.iso.week = unlist(map2(.isodays,
                                               group,
                                               function(.isodays, group) sum(.isodays %in% active.list[[group]])))) #%>%

  # optional: remove intermediate columns
  #select(-.isoweek, -.isoweekfirst, -.isoweeklast, -.isodays)

Preview:

> result
# A tibble: 6 × 10
  group isoweek begin.date end.date    days .isoweek .isoweekfirst .isoweeklast .isodays  active.days.in.iso.week
  <chr>   <int> <chr>      <chr>      <int> <chr>    <date>        <date>       <list>                      <int>
1 A      201933 2019-08-04 2019-08-18     7 2019-W33 2019-08-12    2019-08-18   <int [7]>                       7
2 A      201934 2019-08-18 2019-08-29     7 2019-W34 2019-08-19    2019-08-25   <int [7]>                       7
3 A      201935 2019-08-18 2019-08-29     4 2019-W35 2019-08-26    2019-09-01   <int [7]>                       7
4 A      201936 2019-08-29 2019-09-14     7 2019-W36 2019-09-02    2019-09-08   <int [7]>                       7
5 A      201937 2019-08-29 2019-09-14     6 2019-W37 2019-09-09    2019-09-15   <int [7]>                       6
6 B      201937 2019-09-15 2019-09-16     2 2019-W37 2019-09-09    2019-09-15   <int [7]>                       1

The intended result is in the rightmost column. Intermediate columns and the active.list list can now be removed.

Upvotes: 1

Related Questions