Reputation: 5
I have a dataset containing hundreds of firms in which employees are grouped by firm id
. For some firms there are multiple entries for the same employee
, albeit with different start
and stop
dates.
I want to combine or remove the duplicate employee entries, while keeping both the earlier of the two start dates and later of the two end dates. My dataset looks like this:
df <- structure(list(id = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
employee = c("culver", "maguire", "florenzano","cretu", "tran", "ryman",
"menezes", "dancause", "schumaker", "tyler", "cretu", "tran", "menezes"),
started = structure(c(15014, 15014, 15014, 15279, 15279, 15279, 15279, 15279, 15279, 15279, 15706, 15492, 15706), class = "Date"),
ended = structure(c(18157, 15126, 15126, 15949, 15949, 15461, 15705, 15461, 15461, 15584, 18157,
15706, 15876), class = "Date")), row.names = c(NA, -13L), class = c("tbl_df","tbl", "data.frame"), .Names = c("id", "employee", "started","ended"))
You can see that Firm 2 has duplicate entries for Cretu, Tran, and Menezes. The final dataset should look like this:
df2 <- structure(list(id = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 2),
employee = c("culver", "maguire", "florenzano","cretu", "tran", "ryman",
"menezes", "dancause", "schumaker", "tyler"),
started = structure(c(15014, 15014, 15014, 15279, 15279, 15279, 15279, 15279, 15279, 15279), class = "Date"),
ended = structure(c(18157, 15126, 15126, 18157, 15949, 15461, 15876, 15461, 15461, 15584), class = "Date")), row.names = c(NA, -13L), class = c("tbl_df","tbl", "data.frame"), .Names = c("id", "employee", "started","ended"))
I've tried a number of solutions involving mutate
, which.min
and which.max
with no success. There should be a tidy solution available here but I can't figure it out. Any help would be much appreciated.
Upvotes: 0
Views: 59
Reputation: 5
That did it. The initial code from @slava-kohut with @IceCreamToucan's suggestion returned the correct results. Thank you both for the help.
Upvotes: 0
Reputation: 4233
library(dplyr)
df %>% group_by(id, employee) %>%
summarise(started = min(started), ended = max(ended)) %>%
ungroup()
Upvotes: 1