Remove duplicate rows based on two columns keeping the row that has larger value than other column

Question

I have a data frame ordered desc on the calving column as below.

   Served     Calved      ProfileID
1  2015-07-29 2017-05-07  1346
2  2015-07-29 2017-05-06  2645
3  2016-06-12 2017-05-05  3687
4  2016-05-19 2017-05-05  3687
5  2015-05-21 2017-05-05  3687
6  2013-05-08 2017-05-05  3687
7  2015-08-08 2016-05-04  4235
8  2015-06-14 2016-05-04  4235
9  2015-05-31 2016-05-04  4235
10 2013-08-13 2014-05-02  5425
11 2013-07-23 2014-05-02  5425
12 2012-03-01 2014-05-02  5425
13 2017-07-11 2013-04-22  5425
14 2012-11-01 2013-04-22  5425
15 2015-12-23 2013-04-22  5425
16 2014-05-10 2013-04-22  5425

I would like to remove the duplicates from the Calved column keeping one observance from the ProfileID column depending on the date in the calved column like so

   Served     Calved      ProfileID
1  2015-07-29 2017-05-07  1346
2  2015-07-29 2017-05-06  2645
3  2016-06-12 2017-05-05  3687
7  2015-08-08 2016-05-04  4235
10 2013-08-13 2014-05-02  5425
13 2017-07-11 2013-04-22  5425

I achieved this using

on_served_profileID<-master_arranged[!duplicated(master_arranged[c("Calved","ProfileID")]),]

I would like to add and and condition so that the row which is selected from Calved column is less than the served column and not just the first occurrence in each date.

For line 13 of the output I would rather this be line 14 because the served column is less than the calved column like so, rather than giving me the first observance for each date in the Calved column.

   Served     Calved      ProfileID
1  2015-07-29 2017-05-07  1346
2  2015-07-29 2017-05-06  2645
3  2016-06-12 2017-05-05  3687
7  2015-08-08 2016-05-04  4235
10 2015-08-13 2014-05-02  5425
14 2012-11-01 2013-04-22  5425

I have tried, and variations of:

on_served_profileID<-master_arranged[!duplicated(master_arranged[c("Calved","ProfileID")])& master_arranged$Served < master_arranged$Calved,]

This is to try and select the calved observance which is less than the served observance hence the & condition "$served < $Calved"

Prem · Accepted Answer

library(dplyr)

df$Served <- as.Date(df$Served)
df$Calved <- as.Date(df$Calved)

df %>%
  group_by(Calved, ProfileID) %>%
  summarise(Served = Served[first(which(Served < Calved))]) %>%
  arrange(desc(Calved))

Output is:

      Calved ProfileID     Served
1 2017-05-07      1346 2015-07-29
2 2017-05-06      2645 2015-07-29
3 2017-05-05      3687 2016-06-12
4 2016-05-04      4235 2015-08-08
5 2014-05-02      5425 2013-08-13
6 2013-04-22      5425 2012-11-01

Sample data:

df <- structure(list(Served = c("2015-07-29", "2015-07-29", "2016-06-12", 
"2016-05-19", "2015-05-21", "2013-05-08", "2015-08-08", "2015-06-14", 
"2015-05-31", "2013-08-13", "2013-07-23", "2012-03-01", "2017-07-11", 
"2012-11-01", "2015-12-23", "2014-05-10"), Calved = c("2017-05-07", 
"2017-05-06", "2017-05-05", "2017-05-05", "2017-05-05", "2017-05-05", 
"2016-05-04", "2016-05-04", "2016-05-04", "2014-05-02", "2014-05-02", 
"2014-05-02", "2013-04-22", "2013-04-22", "2013-04-22", "2013-04-22"
), ProfileID = c(1346L, 2645L, 3687L, 3687L, 3687L, 3687L, 4235L, 
4235L, 4235L, 5425L, 5425L, 5425L, 5425L, 5425L, 5425L, 5425L
)), .Names = c("Served", "Calved", "ProfileID"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16"))

Remove duplicate rows based on two columns keeping the row that has larger value than other column

Answers (1)

Related Questions