Mutuelinvestor
Mutuelinvestor

Reputation: 3538

How can you determine the number of events that have occurred over specified number of days using dplyr?

I a have tibble with three columns:

  1. runner - character string representing runners name
  2. race - a numeric value representing the number of races run
  3. date - the date of the race

I would like to add a fourth column, last45d, that represents the number of races that have been run in the last 45 days compared to the date of the current row. My reprex includes sample data and my attempt at producing the new row (i get all NAs).

reprex:

library(tidyverse)
library(lubridate)
library(reprex)

df<-tibble(runner=c("D.Wottle","D.Wottle","D.Wottle","D.Wottle","D.Wottle","D.Wottle","C.Hottle","C.Hottle","C.Hottle","C.Hottle","C.Hottle","C.Hottle","JJ.Watt","JJ.Watt","JJ.Watt","JJ.Watt","JJ.Watt","JJ.Watt"),
           race=c(6,5,4,3,2,1,6,5,4,3,2,1,6,5,4,3,2,1),
           date=c(ymd('20170625'),ymd('20170524'),ymd('20170420'),ymd('20170329'),ymd('20170308'),ymd('20170215'),ymd('20170625'),ymd('20170524'),ymd('20170410'),ymd('20170329'),ymd('20170304'),ymd('20170215'),ymd('20170615'),ymd('20170524'),ymd('20170428'),ymd('20170329'),ymd('20170301'),ymd('20170225')),
           surface=c('T','T','D','T','D','T','T','T','D','T','D','T','T','T','D','T','D','T'),
           distance=c(1400,1400,1600,1400,1500,1400,1400,1400,1600,1400,1500,1400,1400,1400,1600,1400,1500,1400),
           finish=c(1,2,2,1,2,3,2,3,3,2,1,1,3,1,1,3,3,2)

           )

df <- df %>% 
  group_by(runner) %>% 
  mutate(last45 = map_int(date, ~ sum(between(as.numeric(difftime(.x, date, units = "days")), 1e-9, 90)))) %>%
  ungroup()
df
#> # A tibble: 18 x 7
#>    runner    race date       surface distance finish last45
#>    <chr>    <dbl> <date>     <chr>      <dbl>  <dbl>  <int>
#>  1 D.Wottle     6 2017-06-25 T           1400      1      3
#>  2 D.Wottle     5 2017-05-24 T           1400      2      3
#>  3 D.Wottle     4 2017-04-20 D           1600      2      3
#>  4 D.Wottle     3 2017-03-29 T           1400      1      2
#>  5 D.Wottle     2 2017-03-08 D           1500      2      1
#>  6 D.Wottle     1 2017-02-15 T           1400      3      0
#>  7 C.Hottle     6 2017-06-25 T           1400      2      3
#>  8 C.Hottle     5 2017-05-24 T           1400      3      3
#>  9 C.Hottle     4 2017-04-10 D           1600      3      3
#> 10 C.Hottle     3 2017-03-29 T           1400      2      2
#> 11 C.Hottle     2 2017-03-04 D           1500      1      1
#> 12 C.Hottle     1 2017-02-15 T           1400      1      0
#> 13 JJ.Watt      6 2017-06-15 T           1400      3      3
#> 14 JJ.Watt      5 2017-05-24 T           1400      1      4
#> 15 JJ.Watt      4 2017-04-28 D           1600      1      3
#> 16 JJ.Watt      3 2017-03-29 T           1400      3      2
#> 17 JJ.Watt      2 2017-03-01 D           1500      3      1
#> 18 JJ.Watt      1 2017-02-25 T           1400      2      0

Created on 2020-05-13 by the reprex package (v0.3.0)

This is what I'd like the end result to look like:

enter image description here

Upvotes: 3

Views: 86

Answers (1)

r2evans
r2evans

Reputation: 160447

df %>%
  group_by(runner) %>%
  mutate(
    last45 = map_int(date, ~ sum(between(as.numeric(difftime(.x, date, units = "days")), 1e-9, 45)))
    #                ^^^^1                                       ^^^^2
  ) %>%
  ungroup()
# # A tibble: 6 x 4
#   runner    race date       last45
#   <chr>    <dbl> <date>      <int>
# 1 D.Wottle     6 2017-06-25      1
# 2 D.Wottle     5 2017-05-24      1
# 3 D.Wottle     4 2017-04-20      2
# 4 D.Wottle     3 2017-03-29      2
# 5 D.Wottle     2 2017-03-08      1
# 6 D.Wottle     1 2017-02-15      0

Notes:

  • the two references to date are different: "1" (outside the tilde-function) is transferred into .x one at a time, so .x will always be a single date; "2" (inside the tilde-function) is the original column of dates, so will have as many values as the current runner has rows; and

  • I use 1e-9 because if I use 0, then the current day is always considered; by using 1e-9 (or some equally small number), we get the effective (lower,upper] bounds, vice the dplyr::between default of [lower,upper] (both sides closed).

Upvotes: 4

Related Questions