Counting number of rows before the first event and after the last event in another dataframe

Question

I would like to create a new dataframe (reports_before_after_AC_clusters) based on the following two dataframes:

AC events dataframe

tail(AC_clusters)
  Park                  Date        Cluster 
                               
1 Arbour Lake East      2019-07-08       1       
2 Arbour Lake East      2019-07-09       1       
3 Arbour Lake East      2019-07-10       1       
4 Winston-Victoria Park 2021-07-09       1       
5 Winston-Victoria Park 2021-09-10       2       
6 Winston-Victoria Park 2021-09-16       2

Where "Park" is the park where an AC event was conducted, and "cluster" represents a group of AC events where each event is less than 30 days from the previous event.

Report dataframes

tail(Reports_per_park_per_day_2)
  Park                   Date    
                        
1 Arbour Lake East       2019-07-02       
2 Arbour Lake East       2019-07-05      
3 Arbour Lake East       2019-07-20       
4 Winston-Victoria Park  2021-07-02       
5 Winston-Victoria Park  2021-09-08       
6 Winston-Victoria Park  2021-09-22

The parks in this dataframe are the same as in the previous dataframe, but the reports are not classified into clusters.

I would like to create a new dataframe, which would include the Park in which the AC events occured, the cluster of AC events (ex. 1), the date of the first AC event within a cluster (dataframe 1), the date of the last AC event within a cluster (dataframe 1), the number of AC events within that cluster (dataframe 1), the number of reports 2 weeks before the first AC event within that cluster (dataframe 1 and 2), and the number of reports 2 weeks after the last AC event within that cluster (dataframe 1 and 2).

Based on the dataframes provided above, my desired dataframe should look like this:

#reports_before_after_AC_clusters#

Park                   Cluster   Start_date End_date   Number_AC Number_reports_before Number_reports_after
Arbour Lake East       1         2019-07-08 2019-07-10 3         2                     1      
Winston-Victoria Park  1         2021-07-09 2021-07-09 1         1                     0
Winston-Victoria Park  2         2021-09-10 2021-09-16 2         1                     1

Martin Gal · Accepted Answer

You could use

library(lubridate)
library(dplyr)

AC_clusters %>% 
  group_by(Park, Cluster) %>% 
  mutate(Start_date = min(Date), 
            End_date = max(Date)) %>% 
  group_by(Park, Cluster, Start_date, End_date) %>% 
  summarise(Number_AC = n(), .groups = "drop") %>% 
  left_join(Reports_per_park_per_day_2, by = "Park") %>% 
  group_by(Park, Cluster, Start_date, End_date, Number_AC) %>% 
  summarise(Number_reports_before = sum(Date <= Start_date - weeks(2)),
            Number_reports_after  = sum(Date >= End_date + weeks(2)),
            .groups = "drop")

This returns

# A tibble: 3 x 7
  Park                  Cluster Start_date End_date   Number_AC Number_reports_before Number_reports_after
                                                                     
1 Arbour Lake East            1 2019-07-08 2019-07-10         3                     0                    0
2 Winston-Victoria Park       1 2021-07-09 2021-07-09         1                     0                    2
3 Winston-Victoria Park       2 2021-09-10 2021-09-16         2                     1                    0

which is almost your expected output. But there are still some questions:

How do you calculate the Number_reports_before for Winston-Victoria Park in Cluster 1? Why is it 2 instead of 0?
How do you exactly calculate the Number_reports_after?

If we remove the + weeks(2) and - weeks(2) part from the code above, the output changes to

# A tibble: 3 x 7
  Park                  Cluster Start_date End_date   Number_AC Number_reports_before Number_reports_after
                                                                     
1 Arbour Lake East            1 2019-07-08 2019-07-10         3                     2                    1
2 Winston-Victoria Park       1 2021-07-09 2021-07-09         1                     1                    2
3 Winston-Victoria Park       2 2021-09-10 2021-09-16         2                     2                    1

which is close to your expected outcome but still different.

Edit

A small change in the code:

AC_clusters %>% 
  group_by(Park, Cluster) %>% 
  mutate(Start_date = min(Date), 
         End_date = max(Date)) %>% 
  group_by(Park, Cluster, Start_date, End_date) %>% 
  summarise(Number_AC = n(), .groups = "drop") %>% 
  left_join(Reports_per_park_per_day_2, by = "Park") %>% 
  group_by(Park, Cluster, Start_date, End_date, Number_AC) %>% 
  summarise(Number_reports_before = sum(Date <= Start_date & Date >= Start_date - weeks(2)),
            Number_reports_after  = sum(Date >= End_date & Date <= End_date + weeks(2)),
            .groups = "drop")

This returns

# A tibble: 3 x 7
  Park                  Cluster Start_date End_date   Number_AC Number_reports_before Number_reports_after
                                                                     
1 Arbour Lake East            1 2019-07-08 2019-07-10         3                     2                    1
2 Winston-Victoria Park       1 2021-07-09 2021-07-09         1                     1                    0
3 Winston-Victoria Park       2 2021-09-10 2021-09-16         2                     1                    1

Data

AC_clusters <- structure(list(Park = c("Arbour Lake East", "Arbour Lake East", 
"Arbour Lake East", "Winston-Victoria Park", "Winston-Victoria Park", 
"Winston-Victoria Park"), Date = structure(c(18085, 18086, 18087, 
18817, 18880, 18886), class = "Date"), Cluster = c(1, 1, 1, 1, 
2, 2)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))

Reports_per_park_per_day_2 <- structure(list(Park = c("Arbour Lake East", "Arbour Lake East", 
"Arbour Lake East", "Winston-Victoria Park", "Winston-Victoria Park", 
"Winston-Victoria Park"), Date = structure(c(18079, 18082, 18097, 
18810, 18878, 18892), class = "Date")), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))

Counting number of rows before the first event and after the last event in another dataframe

Answers (1)

Edit

Data

Related Questions