J.Gorman
J.Gorman

Reputation: 5

Removing duplicates based on 3 columns in R

I have a data set of 300k+ cases and where a customer id may be repeated several times. Each customer has a date and rank associated with it as well. I'd like to be able to keep only unique customer ids sorted first by date then if there is a duplicate id with a duplicate date it would sort by rank (keeping the rank closest to 1). An example of my data is like this:

Customer.ID  Date     Rank
576293     8/13/2012    2
576293     11/16/2015   6
581252     11/22/2013   4
581252     11/16/2011   6
581252     1/4/2016     5
581600     1/12/2015    3
581600     1/12/2015    2
582560     4/13/2016    1
591674     3/21/2012    6
586334     3/30/2014    1

Ideal outcome would then be like this:

Customer.ID  Date     Rank
576293     11/16/2015   6
581252     1/4/2016     5
581600     1/12/2015    2
582560     4/13/2016    1
591674     3/21/2012    6
586334     3/30/2014    1

Upvotes: 0

Views: 1198

Answers (1)

aichao
aichao

Reputation: 7435

With the desired output of the OP clarified:

We can also do this with base R, which will be faster than the below dplyr approach using group_by(Customer.ID) since we are not going to have to loop over all unique Customer.ID:

df <- df[order(-df$Customer.ID,as.Date(df$Date, format="%m/%d/%Y"),-df$Rank, decreasing=TRUE),]
res <- df[!duplicated(df$Customer.ID),]

Notes:

  1. First, sort by Customer.ID in ascending order followed by Date in descending order followed by Rank in ascending order.
  2. Remove the duplicates in Customer.ID so that only the first row for each Customer.ID is kept.

The result using your posted data as a data frame df (without converting the Date column) in ascending order for Customer.ID:

print(res)
##   Customer.ID       Date Rank
##2       576293 11/16/2015    6
##5       581252   1/4/2016    5
##7       581600  1/12/2015    2
##8       582560  4/13/2016    1
##10      586334  3/30/2014    1
##9       591674  3/21/2012    6

Data:

df <- structure(list(Customer.ID = c(591674L, 586334L, 582560L, 581600L, 
581252L, 576293L), Date = c("3/21/2012", "3/30/2014", "4/13/2016", 
"1/12/2015", "1/4/2016", "11/16/2015"), Rank = c(6L, 1L, 1L, 
2L, 5L, 6L)), .Names = c("Customer.ID", "Date", "Rank"), row.names = c(9L, 
10L, 8L, 7L, 5L, 2L), class = "data.frame")

If you want to keep only the latest date (followed by lower rank) row for each Customer.ID, you can do the following using dplyr:

library(dplyr)
res <- df %>% group_by(Customer.ID) %>% arrange(desc(Date),Rank) %>% 
              summarise_all(funs(first)) %>%
              ungroup() %>% arrange(Customer.ID)

Notes:

  1. group_by Customer.ID and sort using arrange by Date in descending order and Rank by ascending order.
  2. summarise_all to keep only the first row from each Customer.ID.
  3. Finally, ungroup and sort by Customer.ID to get your desired result.

Using your data as a data frame df with the Date column converted to the Date class:

print(res)
### A tibble: 7 x 3
##  Customer.ID       Date  Rank
##        <int>     <date> <int>
##1      576293 2015-11-16     6
##2      581252 2016-01-04     5
##3      581600 2015-01-12     2
##4      582560 2016-04-13     1
##5      586334 2014-03-30     1
##6      591674 2012-03-21     6

Data:

df <- structure(list(Customer.ID = c(576293L, 576293L, 581252L, 581252L, 
581252L, 581600L, 581600L, 582560L, 591674L, 586334L), Date = structure(c(15565, 
16755, 16031, 15294, 16804, 16447, 16447, 16904, 15420, 16159
), class = "Date"), Rank = c(2L, 6L, 4L, 6L, 5L, 3L, 2L, 1L, 
6L, 1L)), .Names = c("Customer.ID", "Date", "Rank"), row.names = c(NA, 
-10L), class = "data.frame")

Upvotes: 1

Related Questions