Reputation: 25
Let's say I have school enrollment data stored in this format, with start date and end date fields:
unique_name | enrollment_start | enrollment_end |
---|---|---|
Amy | 1, Jan, 2017 | 30, Sep 2018 |
Franklin | 1, Jan, 2017 | 19, Feb, 2017 |
Franklin | 5, Jun, 2017 | 4, Feb, 2018 |
Franklin | 21, Oct, 2018 | 9, Mar, 2019 |
Samir | 1, Jun, 2017 | 4, Feb, 2017 |
Samir | 5, Apr, 2017 | 12, Sep, 2018 |
... | ... | ... |
And I want to produce aggregated counts of enrollment by month like this:
month | enrollment_count |
---|---|
Jan, 2017 | 25 |
Feb, 2017 | 31 |
Mar, 2017 | 19 |
Apr, 2017 | 34 |
May, 2017 | 29 |
Jun, 2017 | 32 |
... | ... |
Is there an easy way to accomplish this with dplyr?
The only way I can think to do this is by looping over a list of all months from range month_min to month_max to count the number of rows with start or stop dates that fall inside each month. Hoping for easier code.
Upvotes: 2
Views: 753
Reputation: 2960
I think this can be done pretty elegantly with the clock and ivs packages. You seem to want monthly counts, so you can use the year-month type from clock. And ivs is a package dedicated to working with intervals of data, which is exactly what you have here. Here we assume that if your enrollment start/end fell in a month, then you should be considered active in that month.
library(ivs)
library(clock)
library(dplyr, warn.conflicts = FALSE)
enrollments <- tribble(
~unique_name, ~enrollment_start, ~enrollment_end,
"Amy", "1, Jan, 2017", "30, Sep, 2018",
"Franklin", "1, Jan, 2017", "19, Feb, 2017",
"Franklin", "5, Jun, 2017", "4, Feb, 2018",
"Franklin", "21, Oct, 2018", "9, Mar, 2019",
"Samir", "1, Jan, 2017", "4, Feb, 2017",
"Samir", "5, Apr, 2017", "12, Sep, 2018"
)
# Parse these into "day" precision year-month-day objects, then restrict
# them to just "month" precision because that is all we need
enrollments <- enrollments %>%
mutate(
start = enrollment_start %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month"),
end = enrollment_end %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month") %>%
add_months(1),
.keep = "unused"
)
enrollments
#> # A tibble: 6 × 3
#> unique_name start end
#> <chr> <ymd<month>> <ymd<month>>
#> 1 Amy 2017-01 2018-10
#> 2 Franklin 2017-01 2017-03
#> 3 Franklin 2017-06 2018-03
#> 4 Franklin 2018-10 2019-04
#> 5 Samir 2017-01 2017-03
#> 6 Samir 2017-04 2018-10
# Create an interval vector, note that these are half-open intervals.
# The month on the RHS is not included, which is why we added 1 to `end` above.
enrollments <- enrollments %>%
mutate(active = iv(start, end), .keep = "unused")
enrollments
#> # A tibble: 6 × 2
#> unique_name active
#> <chr> <iv<ymd<month>>>
#> 1 Amy [2017-01, 2018-10)
#> 2 Franklin [2017-01, 2017-03)
#> 3 Franklin [2017-06, 2018-03)
#> 4 Franklin [2018-10, 2019-04)
#> 5 Samir [2017-01, 2017-03)
#> 6 Samir [2017-04, 2018-10)
# We'll generate a sequence of months that will be part of the final result
bounds <- range(enrollments$active)
lower <- iv_start(bounds[[1]])
upper <- iv_end(bounds[[2]]) - 1L
months <- tibble(month = seq(lower, upper, by = 1))
months
#> # A tibble: 27 × 1
#> month
#> <ymd<month>>
#> 1 2017-01
#> 2 2017-02
#> 3 2017-03
#> 4 2017-04
#> 5 2017-05
#> 6 2017-06
#> 7 2017-07
#> 8 2017-08
#> 9 2017-09
#> 10 2017-10
#> # … with 17 more rows
# To actually compute the counts, use `iv_count_between()`, which counts up all
# instances where `month[i]` is between any interval in `enrollments$active`
months %>%
mutate(count = iv_count_between(month, enrollments$active))
#> # A tibble: 27 × 2
#> month count
#> <ymd<month>> <int>
#> 1 2017-01 3
#> 2 2017-02 3
#> 3 2017-03 1
#> 4 2017-04 2
#> 5 2017-05 2
#> 6 2017-06 3
#> 7 2017-07 3
#> 8 2017-08 3
#> 9 2017-09 3
#> 10 2017-10 3
#> # … with 17 more rows
Created on 2022-04-05 by the reprex package (v2.0.1)
Upvotes: 2
Reputation: 416
Here's my take on this with dplyr
and tidyr
.
complete
.data %>%
pivot_longer(cols=c('enrollment_start','enrollment_end')) %>%
mutate(value = as.Date(value, format = "%d, %B, %Y")) %>%
mutate(value = lubridate::floor_date(value, 'month')) %>%
# unique_name name value
# <chr> <chr> <date>
# 1 Amy enrollment_start 2017-01-01
# 2 Amy enrollment_end 2018-09-30
# 3 Franklin enrollment_start 2017-01-01
# 4 Franklin enrollment_end 2017-02-19
# ..etc.
group_by(unique_name) %>%
complete(value = seq.Date(min(value), max(value), by="month")) %>%
arrange(unique_name, value)
enrollment_count <- group_by(data, value) %>%
count()
Edit: I forgot to floor the dates in order to properly aggregate per period at the end. Added floor_date
from lubridate
to do this.
Upvotes: 1
Reputation: 17204
Create a list column containing the sequence of months between each set of dates, then unnest and count.
Notes:
lubridate::floor_date()
to round enrollment_start
to the first day of the month. Otherwise, seq()
may skip months if enrollment_start
is on the 29th of the month or later.enrollment_start
later than enrollment_end
-- I assumed this was an error and removed.library(tidyverse)
library(lubridate)
enrollments %>%
mutate(
across(c(enrollment_start, enrollment_end), dmy), # convert to date
month = map2(
floor_date(enrollment_start, unit = "month"), # round to 1st day
enrollment_end,
~ seq(.x, .y, by = "month")
)
) %>%
unnest_longer(month) %>%
count(month, name = "enrollment_count")
#> # A tibble: 27 x 2
#> month enrollment_count
#> <date> <int>
#> 1 2017-01-01 2
#> 2 2017-02-01 2
#> 3 2017-03-01 1
#> 4 2017-04-01 2
#> 5 2017-05-01 2
#> 6 2017-06-01 3
#> 7 2017-07-01 3
#> 8 2017-08-01 3
#> 9 2017-09-01 3
#> 10 2017-10-01 3
#> # ... with 17 more rows
Created on 2022-03-25 by the reprex package (v2.0.1)
Upvotes: 1