Reputation: 423
I am dealing with a dataset which looks like this below
Id Time Location
754005 13:19:00 HK564
754005 13:19:00 IE578
980278 19:51:00 AK177
980278 21:15:00 JB237
The Id
column contains duplicate values, Time
may or may not contain duplicates, and the Location
column contains unique values.
I am trying to filter this dataset where the Id
values are unique based on this logic:
1) If the time values are the same for that Id
, then retain the last row for that Id
. In other words, id 754005
is repeated twice, the times are the same but Location
is different so retaining the last row means retaining this row
754005 13:19:00 IE578
2) If the time values are different for that Id
, then retain the row with the latest time value for that Id
. In other words, id 980278
has two different time values, 19:51:00
, 21:15:00
, retain the row with time 21:15:00
for this id because this is the latest time.
The final dataset should looks like this below
Id Time Location
754005 13:19:00 IE578
980278 21:15:00 JB237
Upvotes: 1
Views: 69
Reputation: 42544
This can be achieved by ordering the dataset by Time
and by picking the last observation in each group. With data.table
this becomes "one-liner":
library(data.table)
setDT(DF)[order(Time), .SD[.N], by = Id]
Id Time Location 1: 754005 13:19:00 IE578 2: 980278 21:15:00 JB237
Alternative, the tail()
function can be used to pick the last observation in each group:
setDT(DF)[order(Time), tail(.SD, 1), by = Id]
This solutions needs less lines of code than the other answers posted so far, especially the data.table
code by digEmAll. So, I feel this deserves a detailed explanation:
setDT(DF)
coerces DF
into a data.table object by reference, i.e., without copying while as.data.table(DF)
creates a copy.order(Time)
orders Time
lexicographically. There is no need for converting the time string into something else for the purpose of sorting. Here, Time
is a factor whose levels were ordered lexicographically while reading the dataset.help("order")
, any unresolved ties will be left in their original ordering. So, the order of rows for Id == 754005
, e.g., will not be changed..SD
is the subset of rows for each group. .N
is the number of rows in each group. So, .SD[.N]
picks the last row of each group. library(data.table)
DF <- fread("
Id Time Location
754005 13:19:00 HK564
754005 13:19:00 IE578
980278 19:51:00 AK177
980278 21:15:00 JB237",
data.table = FALSE, stringsAsFactors = TRUE)
DF
is a data.frame with character columns turned into factors assuming the "worst case".
str(DF)
'data.frame': 4 obs. of 3 variables: $ Id : int 754005 754005 980278 980278 $ Time : Factor w/ 3 levels "13:19:00","19:51:00",..: 1 1 2 3 $ Location: Factor w/ 4 levels "AK177","HK564",..: 2 3 1 4
Upvotes: 1
Reputation: 57210
A possible approach using base R :
# recreate your input
DF <- read.table(text=
'Id Time Location
754005 13:19:00 HK564
754005 13:19:00 IE578
980278 19:51:00 AK177
980278 21:15:00 JB237',header=TRUE)
# convert time strings to datetimes
dates <- strptime(DF$Time,format='%H:%M:%S',tz='GMT')
# get the desired rows
DF2 <-
DF[unique(
ave(seq_len(nrow(DF)),DF$Id,
FUN=function(i)i[order(dates[i],i,decreasing=TRUE)][1]
)
)
,]
# > DF2
# Id Time Location
# 2 754005 13:19:00 IE578
# 4 980278 21:15:00 JB237
Or using data.table
:
library(data.table)
# convert data.frame to data.table
DT <- as.data.table(DF)
# create column with time strings converted to ITime objects
DT[,TimeConverted := as.ITime(Time,format='%H:%M:%S')]
# group by Id and choose the right row
DT2 <- DT[,{.SD[order(TimeConverted,.I,decreasing=TRUE)[1]]},by=Id]
# remove TimeConverted column
DT2[,TimeConverted := NULL]
# > DT2
# Id Time Location
# 1: 754005 13:19:00 IE578
# 2: 980278 21:15:00 JB237
Upvotes: 0
Reputation: 388982
We can group_by
Id
, and arrange
Time
in desc
ending order and select the last row from each group using slice
library(dplyr)
library(lubridate)
df %>%
group_by(Id) %>%
arrange(desc(hms(Time))) %>%
slice(n())
# Id Time Location
# <int> <fct> <fct>
#1 754005 13:19:00 IE578
#2 980278 21:15:00 JB237
Upvotes: 0