Simon
Simon

Reputation: 10150

R remove all but the first occurence

I have a dataframe that looks like this:

  id     date  time
1  1 15-12-15 10:30
2  2 15-12-15 08:30
3  3 16-12-15 10:30
4  1 16-12-15 08:30

Which can be created like this:

df <- data.frame(id=c(1,2,3,1),
                 date=c("15-12-15", "15-12-15", "16-12-15", "16-12-15"),
                 time=c("10:30", "08:30", "10:30", "08:30"))

I want to remove all rows that have a duplicate id. If duplicates are found, then I want to retain the id that occurred first (by date and time), and remove all others

The way I'm currently doing it is sorting by date, and then:

df <- subset(df, duplicated(df$id) == FALSE)

So that will retain all rows where id isn't duplicated, but the retention criteria isn't clear. It seems to just keep the first occurence that it encounters when sorted by date?

Is there a way to do this without having to date sort first?

Upvotes: 0

Views: 670

Answers (4)

Aramis7d
Aramis7d

Reputation: 2496

One alternative using dplyr can be:

df %>% 
  group_by(id) %>%
  arrange(date, time) %>%
  top_n(1)

which gives:

     id     date   time
  <dbl>   <fctr> <fctr>
1     2 15-12-15  08:30
2     1 15-12-15  10:30
3     3 16-12-15  10:30

Upvotes: 1

Uwe
Uwe

Reputation: 42554

data.table includes an enhanced unique() function which accepts a by parameter to specify the columns to be used for the uniqueness check:

library(data.table)
unique(setDT(df, key = c("date", "time")), by = "id")
   id     date  time
1:  2 15-12-15 08:30
2:  1 15-12-15 10:30
3:  3 16-12-15 10:30

setDT() coerces df to data.table class. The key parameter sorts df and marks it as sorted.

Upvotes: 1

akrun
akrun

Reputation: 887241

We need the fromLast = TRUE

df[!duplicated(df$id, fromLast = TRUE),]
#    id     date  time
#2  2 15-12-15 08:30
#3  3 16-12-15 10:30
#4  1 16-12-15 08:30

If we need to consider the 'date' and 'time' (which is already ordered in the example)

library(lubridate)
df[!duplicated(df$id[with(df, order(id, ymd_hm(paste(date, time))))], fromLast = TRUE),]
#    id     date  time
#2  2 15-12-15 08:30
#3  3 16-12-15 10:30
#4  1 16-12-15 08:30

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389055

You probably need something like this :

library(dplyr)
df %>%
   group_by(id) %>%
   mutate(datetime = as.POSIXct(paste(date, time), format = "%d-%m-%y %H:%M")) %>%
   arrange(datetime) %>%
   slice(1) %>%
   select(-datetime)


#     id     date   time           
#  <dbl>   <fctr> <fctr>           
#1     1 15-12-15  10:30 
#2     2 15-12-15  08:30 
#3     3 16-12-15  10:30 

We create a POSIXct object pasting the date and time column together, arrange them accordingly and select the first observation (one which occur earliest) from each group.

Upvotes: 2

Related Questions