Reputation: 2977
I'm coming from SQL and struggling to understand how R's group_by works. Reading the documentation that it simply "changes how it acts with the other dplyr verbs" does not explain anything. I'm specifically confused on how it interacts with the aggregate function max
in the following snippet:
df <- db %>%
tbl("data_table") %>%
group_by(site_id) %>%
# get most recent started
filter(start_date == max(start_date, na.rm=T),
end_date == max(end_date, na.rm=T)) %>%
rename(field_name = name) %>%
collect()
I'm translating this into SQL, so I'm having to do a sub-query to get the max start/end dates with the group by and then join that to a general query on data_table
to get the field_name
.
How exactly does group_by interact with other dplyr verbs, or max in this instance?
Upvotes: 0
Views: 51
Reputation: 162
Here, group_by
is acting similarly to how PARTITION
works in SQL. It's making "groups", which are equivalent to partitions, and then it does the calculation (like the max) within the partitions.
Here's how it would look in SQL:
SELECT * FROM (
SELECT site_id,
start_date,
row_number() OVER(PARTITION BY site_id ORDER BY start_date DESC) r_start
row_number() OVER(PARTITION BY site_id ORDER BY end_date DESC) r_end
FROM data_table
) where r_start = 1 and r_end = 1
In that dplyr code, you're saying "partition by site_id, and then filter to where the start_date is the max start_date AND end_date is the max end_date WITHIN each partition"
Welcome to dplyr land :)
Upvotes: 2