user9292
user9292

Reputation: 1145

Remove duplicates with old dates

Let’s say I’ve two variables: ID and date:

ID <- c(1,1,1,2,2,2,3,3,3,4,4,4)
Datestr <- c("01/05/2014", "01/16/2014", "01/08/2014","07/05/2014", "07/01/2014", "07/02/2014", "02/05/2014", "02/11/2014", "02/02/2014","01/01/2014", "01/11/2014", "01/03/2014")
dates <- as.Date(Datestr, "%m/%d/%Y")
Mydata <- data.frame(ID, dates)


    ID  dates
1   1   2014-01-05
2   1   2014-01-16
3   1   2014-01-08
4   2   2014-07-05
5   2   2014-07-01
6   2   2014-07-02
7   3   2014-02-05
8   3   2014-02-11
9   3   2014-02-02
10  4   2014-01-01
11  4   2014-01-11
12  4   2014-01-03

Now, I need to remove duplicates and keep the ID with the recent date.

    ID  dates
1   1   2014-01-05
5   2   2014-07-01
9   3   2014-02-02
10  4   2014-01-01

Upvotes: 1

Views: 85

Answers (3)

Rich Scriven
Rich Scriven

Reputation: 99331

If you want only the first date for each ID, you can use min

Using dplyr

library(dplyr)
group_by(Mydata, ID) %>% summarise(dates = min(dates))   
#   ID      dates
# 1  1 2014-01-05
# 2  2 2014-07-01
# 3  3 2014-02-02
# 4  4 2014-01-01

Or data.table

library(data.table)
as.data.table(Mydata)[, .(dates = min(dates)), by = ID][]
#    ID      dates
# 1:  1 2014-01-05
# 2:  2 2014-07-01
# 3:  3 2014-02-02
# 4:  4 2014-01-01

Upvotes: 4

David Arenburg
David Arenburg

Reputation: 92282

Using duplicated would be the most efficient method IMO

Mydata <- Mydata[order(Mydata$dates), ]
Mydata[!duplicated(Mydata$ID), ]
#    ID      dates
# 10  4 2014-01-01
# 1   1 2014-01-05
# 9   3 2014-02-02
# 5   2 2014-07-01

Or using data.tables for additional efficiency gain

library(data.table)
unique(setorder(setDT(Mydata), dates), by = "ID")
#    ID      dates
# 1:  4 2014-01-01
# 2:  1 2014-01-05
# 3:  3 2014-02-02
# 4:  2 2014-07-01

Or duplicated

setorder(setDT(Mydata), dates)[!duplicated(ID)]
#    ID      dates
# 1:  4 2014-01-01
# 2:  1 2014-01-05
# 3:  3 2014-02-02
# 4:  2 2014-07-01

Upvotes: 3

talat
talat

Reputation: 70256

You can use aggregate:

aggregate(dates ~ ID, Mydata, min)
  ID      dates
1  1 2014-01-05
2  2 2014-07-01
3  3 2014-02-02
4  4 2014-01-01

Or

library(dplyr)
group_by(Mydata, ID) %>% filter(min_rank(dates) == 1L)
#Source: local data frame [4 x 2]
#Groups: ID
#
#  ID      dates
#1  1 2014-01-05
#2  2 2014-07-01
#3  3 2014-02-02
#4  4 2014-01-01

Or

group_by(Mydata, ID) %>% slice(which.min(dates))
#Source: local data frame [4 x 2]
#Groups: ID
#
#  ID      dates
#1  1 2014-01-05
#2  2 2014-07-01
#3  3 2014-02-02
#4  4 2014-01-01

Or

group_by(Mydata, ID) %>% arrange(dates) %>% slice(1)
#Source: local data frame [4 x 2]
#Groups: ID
#
#  ID      dates
#1  1 2014-01-05
#2  2 2014-07-01
#3  3 2014-02-02
#4  4 2014-01-01

And a data.table option:

library(data.table)
setDT(Mydata)[,.SD[which.min(dates)], ID]
#   ID      dates
#1:  1 2014-01-05
#2:  2 2014-07-01
#3:  3 2014-02-02
#4:  4 2014-01-01

Upvotes: 4

Related Questions