Jin
Jin

Reputation: 555

Conditionally remove rows based on date and time

I am trying to implement a way to filter this dataframe df

structure(list(Name = c("Jim", "Jane", "Jose", "Matt", "Mickey", 
"Tom", "Peter", "Jane", "Jim", "Jose"), Progress = c("65", "20", 
"80", "20", "65", "45", "20", "70", "25", "80"), EndDate = c("11/25/2018 16:45", 
"11/25/2018 18:05", "11/25/2018 14:20", "12/1/2018 22:52", "11/29/2018 18:15", 
"12/2/2018 15:27", "11/26/2018 12:07", "11/30/2018 11:18", "11/29/2018 18:04", 
"11/29/2018 21:12")), row.names = c(NA, -10L), class = "data.frame")

I want to filter it such that if there are duplicate responses in the Name column like how Jim appears twice I would like to keep the row that has the earliest date and time according to the EndDate column ONLY if the Progress column value is greater than 70. Otherwise I want to take the row that has a later date and time in the EndDate column.

Upvotes: 2

Views: 1667

Answers (3)

akrun
akrun

Reputation: 887501

Based on the condition, we convert the 'EndDate' to DateTime class, then arrange by 'Name', 'EndDate', grouped by 'Name' if the first element of 'Progres' is greater than 70 return index 1 or else the last row index in slice to subset the rows

library(tidyverse)
library(lubridate)
df %>%  
   mutate(EndDate = mdy_hm(EndDate)) %>%
   # if there are multiple formats
   # mutate(EndDate = anytime::anytime(EndDate)) %>%
   arrange(Name, EndDate) %>% 
   group_by(Name) %>%
   slice(if(first(Progress) > 70) 1 else n())
# A tibble: 7 x 3
# Groups:   Name [7]
#  Name   Progress EndDate            
#  <chr>  <chr>    <dttm>             
#1 Jane   70       2018-11-30 11:18:00
#2 Jim    25       2018-11-29 18:04:00
#3 Jose   80       2018-11-25 14:20:00
#4 Matt   20       2018-12-01 22:52:00
#5 Mickey 65       2018-11-29 18:15:00
#6 Peter  20       2018-11-26 12:07:00
#7 Tom    45       2018-12-02 15:27:00

NOTE: if there are multiple 'DateTime' formats, one option is anytime::anytime instead of mdy_hm

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 389135

Using dplyr, we first convert EndDate to date time object using parse_date_time from lubridate then we group_by Name and select row with minimum EndDate if Progress > 70 and number of rows for each Name is more than 1 and maximum EndDate otherwise. If there is only one row for the Name then we select only that one by default.

library(dplyr)
library(lubridate)

df %>%
  mutate(EndDate = parse_date_time(EndDate,c("%m-%d-%y %H:%M","%Y-%m-%d %H:%M:%S"))) %>%
  group_by(Name) %>%
  slice(ifelse(n() > 1, 
        ifelse(any(Progress > 70), which.min(EndDate), which.max(EndDate)), 1))


#  Name   Progress EndDate            
#  <chr>  <chr>    <dttm>             
#1 Jane   70       2018-11-30 11:18:00
#2 Jim    25       2018-11-29 18:04:00
#3 Jose   80       2018-11-25 14:20:00
#4 Matt   20       2018-12-01 22:52:00
#5 Mickey 65       2018-11-29 18:15:00
#6 Peter  20       2018-11-26 12:07:00
#7 Tom    45       2018-12-02 15:27:00

Upvotes: 3

Wimpel
Wimpel

Reputation: 27762

An (of course) this can also be done using data.table

sample data

df <- structure(list(Name = c("Jim", "Jane", "Jose", "Matt", "Mickey", 
                        "Tom", "Peter", "Jane", "Jim", "Jose"), Progress = c("65", "20", 
                                                                             "80", "20", "65", "45", "20", "70", "25", "80"), EndDate = c("11/25/2018 16:45", 
                                                                                                                                          "11/25/2018 18:05", "11/25/2018 14:20", "12/1/2018 22:52", "11/29/2018 18:15", 
                                                                                                                                          "12/2/2018 15:27", "11/26/2018 12:07", "11/30/2018 11:18", "11/29/2018 18:04", 
                                                                                                                                          "11/29/2018 21:12")), row.names = c(NA, -10L), class = "data.frame")

code

#create the data.table (can also be done using setDT(df) )
dt <- as.data.table( df )
#set the dates to a proper POSIXct-format
dt[, EndDate := as.POSIXct( EndDate, format = "%m/%d/%Y %H:%M") ]
#order omn EndDate (by reference!)
setorder( dt, EndDate )
#summarise by Name, if first Progress >70 then keep it, else keep last Progress
dt[ , list( Progress = ifelse( Progress[1] > 70, Progress[1], Progress[.N] ) ), by = .(Name)][]

benchmarks

microbenchmark::microbenchmark(
  data.table = {
    dt[, EndDate := as.POSIXct( EndDate, format = "%m/%d/%Y %H:%M") ]
    setorder( dt, EndDate )
    dt[ , list( Progress = ifelse( Progress[1] > 70, Progress[1], Progress[.N] ) ), by = .(Name)][]
  },
  tidyverse1 = {
    df %>%  
      mutate(EndDate = mdy_hm(EndDate)) %>%
      arrange(Name, EndDate) %>% 
      group_by(Name) %>%
      slice(if(first(Progress) > 70) 1 else n())
  },
  tidyverse2 = {
    df %>%
      mutate(EndDate = mdy_hm(EndDate)) %>%
      group_by(Name) %>%
      slice(ifelse(n() > 1, 
                   ifelse(any(Progress > 70), which.min(EndDate), which.max(EndDate)), 1))
  }
)


# Unit: milliseconds
#       expr      min       lq     mean   median       uq      max neval
# data.table 1.654241 2.030820 2.709023 2.556978 2.782023 30.36590   100
# tidyverse1 6.847731 7.218286 8.742247 7.516838 8.034861 72.00902   100
# tidyverse2 6.173201 6.506398 7.286639 6.764582 7.088591 52.10180   100

Upvotes: 1

Related Questions