R: Combine duplicate entries and conditionally select dates by group

Question

I have a dataset containing hundreds of firms in which employees are grouped by firm id. For some firms there are multiple entries for the same employee, albeit with different start and stop dates.

I want to combine or remove the duplicate employee entries, while keeping both the earlier of the two start dates and later of the two end dates. My dataset looks like this:

df <- structure(list(id = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
  employee = c("culver", "maguire", "florenzano","cretu", "tran", "ryman",
  "menezes", "dancause", "schumaker", "tyler", "cretu", "tran", "menezes"),
  started = structure(c(15014, 15014, 15014, 15279, 15279, 15279, 15279, 15279, 15279, 15279, 15706, 15492, 15706), class = "Date"),
  ended = structure(c(18157, 15126, 15126, 15949, 15949, 15461, 15705, 15461, 15461, 15584, 18157,
  15706, 15876), class = "Date")), row.names = c(NA, -13L), class = c("tbl_df","tbl", "data.frame"), .Names = c("id", "employee", "started","ended"))

You can see that Firm 2 has duplicate entries for Cretu, Tran, and Menezes. The final dataset should look like this:

df2 <- structure(list(id = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 2),
  employee = c("culver", "maguire", "florenzano","cretu", "tran", "ryman",
  "menezes", "dancause", "schumaker", "tyler"),
  started = structure(c(15014, 15014, 15014, 15279, 15279, 15279, 15279, 15279, 15279, 15279), class = "Date"),
  ended = structure(c(18157, 15126, 15126, 18157, 15949, 15461, 15876, 15461, 15461, 15584), class = "Date")), row.names = c(NA, -13L), class = c("tbl_df","tbl", "data.frame"), .Names = c("id", "employee", "started","ended"))

I've tried a number of solutions involving mutate, which.min and which.max with no success. There should be a tidy solution available here but I can't figure it out. Any help would be much appreciated.

slava-kohut · Accepted Answer

library(dplyr)
df %>% group_by(id, employee) %>% 
  summarise(started = min(started), ended = max(ended)) %>% 
  ungroup()

R: Combine duplicate entries and conditionally select dates by group

Answers (2)

Related Questions