Reputation: 10150
I have a dataframe that looks like this:
id date time
1 1 15-12-15 10:30
2 2 15-12-15 08:30
3 3 16-12-15 10:30
4 1 16-12-15 08:30
Which can be created like this:
df <- data.frame(id=c(1,2,3,1),
date=c("15-12-15", "15-12-15", "16-12-15", "16-12-15"),
time=c("10:30", "08:30", "10:30", "08:30"))
I want to remove all rows that have a duplicate id
. If duplicates are found, then I want to retain the id
that occurred first (by date and time), and remove all others
The way I'm currently doing it is sorting by date
, and then:
df <- subset(df, duplicated(df$id) == FALSE)
So that will retain all rows where id
isn't duplicated, but the retention criteria isn't clear. It seems to just keep the first occurence that it encounters when sorted by date?
Is there a way to do this without having to date sort first?
Upvotes: 0
Views: 670
Reputation: 2496
One alternative using dplyr
can be:
df %>%
group_by(id) %>%
arrange(date, time) %>%
top_n(1)
which gives:
id date time
<dbl> <fctr> <fctr>
1 2 15-12-15 08:30
2 1 15-12-15 10:30
3 3 16-12-15 10:30
Upvotes: 1
Reputation: 42554
data.table
includes an enhanced unique()
function which accepts a by
parameter to specify the columns to be used for the uniqueness check:
library(data.table)
unique(setDT(df, key = c("date", "time")), by = "id")
id date time 1: 2 15-12-15 08:30 2: 1 15-12-15 10:30 3: 3 16-12-15 10:30
setDT()
coerces df
to data.table
class. The key
parameter sorts df
and marks it as sorted.
Upvotes: 1
Reputation: 887241
We need the fromLast = TRUE
df[!duplicated(df$id, fromLast = TRUE),]
# id date time
#2 2 15-12-15 08:30
#3 3 16-12-15 10:30
#4 1 16-12-15 08:30
If we need to consider the 'date' and 'time' (which is already ordered in the example)
library(lubridate)
df[!duplicated(df$id[with(df, order(id, ymd_hm(paste(date, time))))], fromLast = TRUE),]
# id date time
#2 2 15-12-15 08:30
#3 3 16-12-15 10:30
#4 1 16-12-15 08:30
Upvotes: 1
Reputation: 389055
You probably need something like this :
library(dplyr)
df %>%
group_by(id) %>%
mutate(datetime = as.POSIXct(paste(date, time), format = "%d-%m-%y %H:%M")) %>%
arrange(datetime) %>%
slice(1) %>%
select(-datetime)
# id date time
# <dbl> <fctr> <fctr>
#1 1 15-12-15 10:30
#2 2 15-12-15 08:30
#3 3 16-12-15 10:30
We create a POSIXct
object pasting the date
and time
column together, arrange
them accordingly and select the first observation (one which occur earliest) from each group.
Upvotes: 2