Let's say I have school enrollment data stored in this format, with start date and end date fields: unique_name enrollment_start enrollment_end Amy 1, Jan, 2017 30, Sep 2018 Franklin 1, Jan, 2017 19, Feb, 2017 Franklin 5, Jun, 2017 4, Feb, 2018 Franklin 21, Oct, 2018 9, Mar, 2019 Samir 1, Jun, 2017 4, Feb, 2017 Samir 5, Apr, 2017 12, Sep, 2018 ... ... ... And I want to produce aggregated counts of enrollment by month like this: month enrollment_count Jan, 2017 25 Feb, 2017 31 Mar, 2017 19 Apr, 2017 34 May, 2017 29 Jun, 2017 32 ... ... Is there an easy way to accomplish this with dplyr? The only way I can think to do this is by looping over a list of all months from range month_min to month_max to count the number of rows with start or stop dates that fall inside each month. Hoping for easier code.

Reputation: 25

Use dplyr to aggregate counts by month from start-stop ranged variables?

Let's say I have school enrollment data stored in this format, with start date and end date fields:

unique_name	enrollment_start	enrollment_end
Amy	1, Jan, 2017	30, Sep 2018
Franklin	1, Jan, 2017	19, Feb, 2017
Franklin	5, Jun, 2017	4, Feb, 2018
Franklin	21, Oct, 2018	9, Mar, 2019
Samir	1, Jun, 2017	4, Feb, 2017
Samir	5, Apr, 2017	12, Sep, 2018
...	...	...

And I want to produce aggregated counts of enrollment by month like this:

month	enrollment_count
Jan, 2017	25
Feb, 2017	31
Mar, 2017	19
Apr, 2017	34
May, 2017	29
Jun, 2017	32
...	...

Is there an easy way to accomplish this with dplyr?

The only way I can think to do this is by looping over a list of all months from range month_min to month_max to count the number of rows with start or stop dates that fall inside each month. Hoping for easier code.

Upvotes: 2

Answers (3)

Davis Vaughan

Reputation: 2960

I think this can be done pretty elegantly with the clock and ivs packages. You seem to want monthly counts, so you can use the year-month type from clock. And ivs is a package dedicated to working with intervals of data, which is exactly what you have here. Here we assume that if your enrollment start/end fell in a month, then you should be considered active in that month.

library(ivs)
library(clock)
library(dplyr, warn.conflicts = FALSE)

enrollments <- tribble(
  ~unique_name, ~enrollment_start, ~enrollment_end,
  "Amy",        "1, Jan, 2017",    "30, Sep, 2018",
  "Franklin",   "1, Jan, 2017",    "19, Feb, 2017",
  "Franklin",   "5, Jun, 2017",    "4, Feb, 2018",
  "Franklin",   "21, Oct, 2018",   "9, Mar, 2019",
  "Samir",      "1, Jan, 2017",    "4, Feb, 2017",
  "Samir",      "5, Apr, 2017",    "12, Sep, 2018"
)

# Parse these into "day" precision year-month-day objects, then restrict
# them to just "month" precision because that is all we need
enrollments <- enrollments %>%
  mutate(
    start = enrollment_start %>%
      year_month_day_parse(format = "%d, %b, %Y") %>%
      calendar_narrow("month"),
    end = enrollment_end %>%
      year_month_day_parse(format = "%d, %b, %Y") %>%
      calendar_narrow("month") %>%
      add_months(1),
    .keep = "unused"
  )

enrollments
#> # A tibble: 6 × 3
#>   unique_name start        end         
#>   <chr>       <ymd<month>> <ymd<month>>
#> 1 Amy         2017-01      2018-10     
#> 2 Franklin    2017-01      2017-03     
#> 3 Franklin    2017-06      2018-03     
#> 4 Franklin    2018-10      2019-04     
#> 5 Samir       2017-01      2017-03     
#> 6 Samir       2017-04      2018-10

# Create an interval vector, note that these are half-open intervals.
# The month on the RHS is not included, which is why we added 1 to `end` above.
enrollments <- enrollments %>%
  mutate(active = iv(start, end), .keep = "unused")

enrollments
#> # A tibble: 6 × 2
#>   unique_name             active
#>   <chr>         <iv<ymd<month>>>
#> 1 Amy         [2017-01, 2018-10)
#> 2 Franklin    [2017-01, 2017-03)
#> 3 Franklin    [2017-06, 2018-03)
#> 4 Franklin    [2018-10, 2019-04)
#> 5 Samir       [2017-01, 2017-03)
#> 6 Samir       [2017-04, 2018-10)

# We'll generate a sequence of months that will be part of the final result
bounds <- range(enrollments$active)
lower <- iv_start(bounds[[1]])
upper <- iv_end(bounds[[2]]) - 1L

months <- tibble(month = seq(lower, upper, by = 1))
months
#> # A tibble: 27 × 1
#>    month       
#>    <ymd<month>>
#>  1 2017-01     
#>  2 2017-02     
#>  3 2017-03     
#>  4 2017-04     
#>  5 2017-05     
#>  6 2017-06     
#>  7 2017-07     
#>  8 2017-08     
#>  9 2017-09     
#> 10 2017-10     
#> # … with 17 more rows

# To actually compute the counts, use `iv_count_between()`, which counts up all
# instances where `month[i]` is between any interval in `enrollments$active`
months %>%
  mutate(count = iv_count_between(month, enrollments$active))
#> # A tibble: 27 × 2
#>    month        count
#>    <ymd<month>> <int>
#>  1 2017-01          3
#>  2 2017-02          3
#>  3 2017-03          1
#>  4 2017-04          2
#>  5 2017-05          2
#>  6 2017-06          3
#>  7 2017-07          3
#>  8 2017-08          3
#>  9 2017-09          3
#> 10 2017-10          3
#> # … with 17 more rows

^{Created on 2022-04-05 by the reprex package (v2.0.1)}

Upvotes: 2

Brayn

Reputation: 416

Here's my take on this with dplyr and tidyr.

Pivot the data creating multiple rows per student and format your dates.
group on student and generate missing months using complete.
group on the generated periods and count.

data %>%
  pivot_longer(cols=c('enrollment_start','enrollment_end')) %>%
    mutate(value = as.Date(value, format =  "%d, %B, %Y")) %>%
    mutate(value = lubridate::floor_date(value, 'month')) %>%
  
#   unique_name name             value     
#   <chr>       <chr>            <date>    
# 1 Amy         enrollment_start 2017-01-01
# 2 Amy         enrollment_end   2018-09-30
# 3 Franklin    enrollment_start 2017-01-01
# 4 Franklin    enrollment_end   2017-02-19
#   ..etc.

  group_by(unique_name) %>%
  complete(value = seq.Date(min(value), max(value), by="month")) %>%
  arrange(unique_name, value) 

enrollment_count <- group_by(data, value) %>%
  count()

Edit: I forgot to floor the dates in order to properly aggregate per period at the end. Added floor_date from lubridate to do this.

Upvotes: 1

zephryl

Reputation: 17204

Create a list column containing the sequence of months between each set of dates, then unnest and count.

Notes:

I use lubridate::floor_date() to round enrollment_start to the first day of the month. Otherwise, seq() may skip months if enrollment_start is on the 29th of the month or later.
The fifth row of your example data has enrollment_start later than enrollment_end -- I assumed this was an error and removed.

library(tidyverse)
library(lubridate)

enrollments %>% 
  mutate(
    across(c(enrollment_start, enrollment_end), dmy),  # convert to date
    month = map2(
      floor_date(enrollment_start, unit = "month"),    # round to 1st day
      enrollment_end,
      ~ seq(.x, .y, by = "month")
    )
  ) %>% 
  unnest_longer(month) %>% 
  count(month, name = "enrollment_count")

#> # A tibble: 27 x 2
#>    month      enrollment_count
#>    <date>                <int>
#>  1 2017-01-01                2
#>  2 2017-02-01                2
#>  3 2017-03-01                1
#>  4 2017-04-01                2
#>  5 2017-05-01                2
#>  6 2017-06-01                3
#>  7 2017-07-01                3
#>  8 2017-08-01                3
#>  9 2017-09-01                3
#> 10 2017-10-01                3
#> # ... with 17 more rows

^{Created on 2022-03-25 by the reprex package (v2.0.1)}

Upvotes: 1

Use dplyr to aggregate counts by month from start-stop ranged variables?

Answers (3)

Related Questions