gwcd7p
gwcd7p

Reputation: 5

R: Combine duplicate entries and conditionally select dates by group

I have a dataset containing hundreds of firms in which employees are grouped by firm id. For some firms there are multiple entries for the same employee, albeit with different start and stop dates.

I want to combine or remove the duplicate employee entries, while keeping both the earlier of the two start dates and later of the two end dates. My dataset looks like this:

df <- structure(list(id = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
  employee = c("culver", "maguire", "florenzano","cretu", "tran", "ryman",
  "menezes", "dancause", "schumaker", "tyler", "cretu", "tran", "menezes"),
  started = structure(c(15014, 15014, 15014, 15279, 15279, 15279, 15279, 15279, 15279, 15279, 15706, 15492, 15706), class = "Date"),
  ended = structure(c(18157, 15126, 15126, 15949, 15949, 15461, 15705, 15461, 15461, 15584, 18157,
  15706, 15876), class = "Date")), row.names = c(NA, -13L), class = c("tbl_df","tbl", "data.frame"), .Names = c("id", "employee", "started","ended"))

You can see that Firm 2 has duplicate entries for Cretu, Tran, and Menezes. The final dataset should look like this:

df2 <- structure(list(id = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 2),
  employee = c("culver", "maguire", "florenzano","cretu", "tran", "ryman",
  "menezes", "dancause", "schumaker", "tyler"),
  started = structure(c(15014, 15014, 15014, 15279, 15279, 15279, 15279, 15279, 15279, 15279), class = "Date"),
  ended = structure(c(18157, 15126, 15126, 18157, 15949, 15461, 15876, 15461, 15461, 15584), class = "Date")), row.names = c(NA, -13L), class = c("tbl_df","tbl", "data.frame"), .Names = c("id", "employee", "started","ended"))

I've tried a number of solutions involving mutate, which.min and which.max with no success. There should be a tidy solution available here but I can't figure it out. Any help would be much appreciated.

Upvotes: 0

Views: 59

Answers (2)

gwcd7p
gwcd7p

Reputation: 5

That did it. The initial code from @slava-kohut with @IceCreamToucan's suggestion returned the correct results. Thank you both for the help.

Upvotes: 0

slava-kohut
slava-kohut

Reputation: 4233

library(dplyr)
df %>% group_by(id, employee) %>% 
  summarise(started = min(started), ended = max(ended)) %>% 
  ungroup()

Upvotes: 1

Related Questions