Reputation: 129
I would like to create a new dataframe (reports_before_after_AC_clusters) based on the following two dataframes:
tail(AC_clusters)
Park Date Cluster
<chr> <date> <dbl>
1 Arbour Lake East 2019-07-08 1
2 Arbour Lake East 2019-07-09 1
3 Arbour Lake East 2019-07-10 1
4 Winston-Victoria Park 2021-07-09 1
5 Winston-Victoria Park 2021-09-10 2
6 Winston-Victoria Park 2021-09-16 2
Where "Park" is the park where an AC event was conducted, and "cluster" represents a group of AC events where each event is less than 30 days from the previous event.
tail(Reports_per_park_per_day_2)
Park Date
<chr> <date>
1 Arbour Lake East 2019-07-02
2 Arbour Lake East 2019-07-05
3 Arbour Lake East 2019-07-20
4 Winston-Victoria Park 2021-07-02
5 Winston-Victoria Park 2021-09-08
6 Winston-Victoria Park 2021-09-22
The parks in this dataframe are the same as in the previous dataframe, but the reports are not classified into clusters.
I would like to create a new dataframe, which would include the Park in which the AC events occured, the cluster of AC events (ex. 1), the date of the first AC event within a cluster (dataframe 1), the date of the last AC event within a cluster (dataframe 1), the number of AC events within that cluster (dataframe 1), the number of reports 2 weeks before the first AC event within that cluster (dataframe 1 and 2), and the number of reports 2 weeks after the last AC event within that cluster (dataframe 1 and 2).
Based on the dataframes provided above, my desired dataframe should look like this:
#reports_before_after_AC_clusters#
Park Cluster Start_date End_date Number_AC Number_reports_before Number_reports_after
Arbour Lake East 1 2019-07-08 2019-07-10 3 2 1
Winston-Victoria Park 1 2021-07-09 2021-07-09 1 1 0
Winston-Victoria Park 2 2021-09-10 2021-09-16 2 1 1
Upvotes: 1
Views: 133
Reputation: 16978
You could use
library(lubridate)
library(dplyr)
AC_clusters %>%
group_by(Park, Cluster) %>%
mutate(Start_date = min(Date),
End_date = max(Date)) %>%
group_by(Park, Cluster, Start_date, End_date) %>%
summarise(Number_AC = n(), .groups = "drop") %>%
left_join(Reports_per_park_per_day_2, by = "Park") %>%
group_by(Park, Cluster, Start_date, End_date, Number_AC) %>%
summarise(Number_reports_before = sum(Date <= Start_date - weeks(2)),
Number_reports_after = sum(Date >= End_date + weeks(2)),
.groups = "drop")
This returns
# A tibble: 3 x 7
Park Cluster Start_date End_date Number_AC Number_reports_before Number_reports_after
<chr> <dbl> <date> <date> <int> <int> <int>
1 Arbour Lake East 1 2019-07-08 2019-07-10 3 0 0
2 Winston-Victoria Park 1 2021-07-09 2021-07-09 1 0 2
3 Winston-Victoria Park 2 2021-09-10 2021-09-16 2 1 0
which is almost your expected output. But there are still some questions:
If we remove the + weeks(2)
and - weeks(2)
part from the code above, the output changes to
# A tibble: 3 x 7
Park Cluster Start_date End_date Number_AC Number_reports_before Number_reports_after
<chr> <dbl> <date> <date> <int> <int> <int>
1 Arbour Lake East 1 2019-07-08 2019-07-10 3 2 1
2 Winston-Victoria Park 1 2021-07-09 2021-07-09 1 1 2
3 Winston-Victoria Park 2 2021-09-10 2021-09-16 2 2 1
which is close to your expected outcome but still different.
A small change in the code:
AC_clusters %>%
group_by(Park, Cluster) %>%
mutate(Start_date = min(Date),
End_date = max(Date)) %>%
group_by(Park, Cluster, Start_date, End_date) %>%
summarise(Number_AC = n(), .groups = "drop") %>%
left_join(Reports_per_park_per_day_2, by = "Park") %>%
group_by(Park, Cluster, Start_date, End_date, Number_AC) %>%
summarise(Number_reports_before = sum(Date <= Start_date & Date >= Start_date - weeks(2)),
Number_reports_after = sum(Date >= End_date & Date <= End_date + weeks(2)),
.groups = "drop")
This returns
# A tibble: 3 x 7
Park Cluster Start_date End_date Number_AC Number_reports_before Number_reports_after
<chr> <dbl> <date> <date> <int> <int> <int>
1 Arbour Lake East 1 2019-07-08 2019-07-10 3 2 1
2 Winston-Victoria Park 1 2021-07-09 2021-07-09 1 1 0
3 Winston-Victoria Park 2 2021-09-10 2021-09-16 2 1 1
AC_clusters <- structure(list(Park = c("Arbour Lake East", "Arbour Lake East",
"Arbour Lake East", "Winston-Victoria Park", "Winston-Victoria Park",
"Winston-Victoria Park"), Date = structure(c(18085, 18086, 18087,
18817, 18880, 18886), class = "Date"), Cluster = c(1, 1, 1, 1,
2, 2)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
Reports_per_park_per_day_2 <- structure(list(Park = c("Arbour Lake East", "Arbour Lake East",
"Arbour Lake East", "Winston-Victoria Park", "Winston-Victoria Park",
"Winston-Victoria Park"), Date = structure(c(18079, 18082, 18097,
18810, 18878, 18892), class = "Date")), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
Upvotes: 1