Emily Fassbender
Emily Fassbender

Reputation: 423

Subset dataset by time criteria by ID

I am dealing with a dataset which looks like this below

 Id           Time           Location
 754005       13:19:00       HK564
 754005       13:19:00       IE578
 980278       19:51:00       AK177
 980278       21:15:00       JB237

The Id column contains duplicate values, Time may or may not contain duplicates, and the Location column contains unique values.

I am trying to filter this dataset where the Id values are unique based on this logic:

1) If the time values are the same for that Id, then retain the last row for that Id. In other words, id 754005 is repeated twice, the times are the same but Location is different so retaining the last row means retaining this row

754005       13:19:00       IE578

2) If the time values are different for that Id, then retain the row with the latest time value for that Id. In other words, id 980278 has two different time values, 19:51:00, 21:15:00, retain the row with time 21:15:00 for this id because this is the latest time.

The final dataset should looks like this below

 Id           Time           Location
 754005       13:19:00       IE578
 980278       21:15:00       JB237

Upvotes: 1

Views: 69

Answers (3)

Uwe
Uwe

Reputation: 42544

This can be achieved by ordering the dataset by Time and by picking the last observation in each group. With data.table this becomes "one-liner":

library(data.table)
setDT(DF)[order(Time), .SD[.N], by = Id]
       Id     Time Location
1: 754005 13:19:00    IE578
2: 980278 21:15:00    JB237

Alternative, the tail() function can be used to pick the last observation in each group:

setDT(DF)[order(Time), tail(.SD, 1), by = Id]

Explanation

This solutions needs less lines of code than the other answers posted so far, especially the data.table code by digEmAll. So, I feel this deserves a detailed explanation:

  • setDT(DF) coerces DF into a data.table object by reference, i.e., without copying while as.data.table(DF) creates a copy.
  • order(Time) orders Time lexicographically. There is no need for converting the time string into something else for the purpose of sorting. Here, Time is a factor whose levels were ordered lexicographically while reading the dataset.
    According to the help page help("order"), any unresolved ties will be left in their original ordering. So, the order of rows for Id == 754005, e.g., will not be changed.
  • .SD is the subset of rows for each group. .N is the number of rows in each group. So, .SD[.N] picks the last row of each group.

Data

library(data.table)
DF <- fread("
 Id           Time           Location
 754005       13:19:00       HK564
 754005       13:19:00       IE578
 980278       19:51:00       AK177
 980278       21:15:00       JB237", 
data.table = FALSE, stringsAsFactors = TRUE)

DF is a data.frame with character columns turned into factors assuming the "worst case".

str(DF)
'data.frame': 4 obs. of  3 variables:
 $ Id      : int  754005 754005 980278 980278
 $ Time    : Factor w/ 3 levels "13:19:00","19:51:00",..: 1 1 2 3
 $ Location: Factor w/ 4 levels "AK177","HK564",..: 2 3 1 4

Upvotes: 1

digEmAll
digEmAll

Reputation: 57210

A possible approach using base R :

# recreate your input
DF <- read.table(text=
'Id           Time           Location
754005       13:19:00       HK564
754005       13:19:00       IE578
980278       19:51:00       AK177
980278       21:15:00       JB237',header=TRUE)

# convert time strings to datetimes
dates <- strptime(DF$Time,format='%H:%M:%S',tz='GMT')

# get the desired rows
DF2 <- 
DF[unique(
   ave(seq_len(nrow(DF)),DF$Id,
       FUN=function(i)i[order(dates[i],i,decreasing=TRUE)][1]
      )
   )
,]

# > DF2
#       Id     Time Location
# 2 754005 13:19:00    IE578
# 4 980278 21:15:00    JB237

Or using data.table :

library(data.table)
# convert data.frame to data.table
DT <- as.data.table(DF)

# create column with time strings converted to ITime objects
DT[,TimeConverted := as.ITime(Time,format='%H:%M:%S')]

# group by Id and choose the right row
DT2 <- DT[,{.SD[order(TimeConverted,.I,decreasing=TRUE)[1]]},by=Id]

# remove TimeConverted column
DT2[,TimeConverted := NULL]

# > DT2
#        Id     Time Location
# 1: 754005 13:19:00    IE578
# 2: 980278 21:15:00    JB237

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388982

We can group_by Id, and arrange Time in descending order and select the last row from each group using slice

library(dplyr)
library(lubridate)

df %>%
  group_by(Id) %>%
  arrange(desc(hms(Time))) %>%
  slice(n())

#     Id    Time   Location
#   <int> <fct>    <fct>   
#1 754005 13:19:00 IE578   
#2 980278 21:15:00 JB237  

Upvotes: 0

Related Questions