simplycoding
simplycoding

Reputation: 2977

How does R's group_by exactly interact with other dplyr verbs?

I'm coming from SQL and struggling to understand how R's group_by works. Reading the documentation that it simply "changes how it acts with the other dplyr verbs" does not explain anything. I'm specifically confused on how it interacts with the aggregate function max in the following snippet:

df <- db %>%
  tbl("data_table") %>% 
  group_by(site_id) %>%
  # get most recent started
  filter(start_date == max(start_date, na.rm=T), 
         end_date == max(end_date, na.rm=T)) %>%
  rename(field_name = name) %>%
  collect()

I'm translating this into SQL, so I'm having to do a sub-query to get the max start/end dates with the group by and then join that to a general query on data_table to get the field_name.

How exactly does group_by interact with other dplyr verbs, or max in this instance?

Upvotes: 0

Views: 51

Answers (1)

redarah
redarah

Reputation: 162

Here, group_by is acting similarly to how PARTITION works in SQL. It's making "groups", which are equivalent to partitions, and then it does the calculation (like the max) within the partitions.

Here's how it would look in SQL:

SELECT * FROM (
  SELECT site_id, 
     start_date, 
     row_number() OVER(PARTITION BY site_id ORDER BY start_date DESC) r_start
     row_number() OVER(PARTITION BY site_id ORDER BY end_date DESC) r_end
  FROM data_table
) where r_start = 1 and r_end = 1

In that dplyr code, you're saying "partition by site_id, and then filter to where the start_date is the max start_date AND end_date is the max end_date WITHIN each partition"

Welcome to dplyr land :)

Upvotes: 2

Related Questions