Reputation: 555
I am trying to implement a way to filter this dataframe df
structure(list(Name = c("Jim", "Jane", "Jose", "Matt", "Mickey",
"Tom", "Peter", "Jane", "Jim", "Jose"), Progress = c("65", "20",
"80", "20", "65", "45", "20", "70", "25", "80"), EndDate = c("11/25/2018 16:45",
"11/25/2018 18:05", "11/25/2018 14:20", "12/1/2018 22:52", "11/29/2018 18:15",
"12/2/2018 15:27", "11/26/2018 12:07", "11/30/2018 11:18", "11/29/2018 18:04",
"11/29/2018 21:12")), row.names = c(NA, -10L), class = "data.frame")
I want to filter it such that if there are duplicate responses in the Name
column like how Jim appears twice I would like to keep the row that has the earliest date and time according to the EndDate
column ONLY if the Progress
column value is greater than 70. Otherwise I want to take the row that has a later date and time in the EndDate
column.
Upvotes: 2
Views: 1667
Reputation: 887501
Based on the condition, we convert the 'EndDate' to DateTime
class, then arrange
by 'Name', 'EndDate', grouped by 'Name' if
the first
element of 'Progres' is greater than 70 return index 1 or else
the last row index in slice
to subset the rows
library(tidyverse)
library(lubridate)
df %>%
mutate(EndDate = mdy_hm(EndDate)) %>%
# if there are multiple formats
# mutate(EndDate = anytime::anytime(EndDate)) %>%
arrange(Name, EndDate) %>%
group_by(Name) %>%
slice(if(first(Progress) > 70) 1 else n())
# A tibble: 7 x 3
# Groups: Name [7]
# Name Progress EndDate
# <chr> <chr> <dttm>
#1 Jane 70 2018-11-30 11:18:00
#2 Jim 25 2018-11-29 18:04:00
#3 Jose 80 2018-11-25 14:20:00
#4 Matt 20 2018-12-01 22:52:00
#5 Mickey 65 2018-11-29 18:15:00
#6 Peter 20 2018-11-26 12:07:00
#7 Tom 45 2018-12-02 15:27:00
NOTE: if there are multiple 'DateTime' formats, one option is anytime::anytime
instead of mdy_hm
Upvotes: 3
Reputation: 389135
Using dplyr
, we first convert EndDate
to date time object using parse_date_time
from lubridate
then we group_by
Name
and select row with minimum EndDate
if Progress > 70
and number of rows for each Name
is more than 1 and maximum EndDate
otherwise. If there is only one row for the Name
then we select only that one by default.
library(dplyr)
library(lubridate)
df %>%
mutate(EndDate = parse_date_time(EndDate,c("%m-%d-%y %H:%M","%Y-%m-%d %H:%M:%S"))) %>%
group_by(Name) %>%
slice(ifelse(n() > 1,
ifelse(any(Progress > 70), which.min(EndDate), which.max(EndDate)), 1))
# Name Progress EndDate
# <chr> <chr> <dttm>
#1 Jane 70 2018-11-30 11:18:00
#2 Jim 25 2018-11-29 18:04:00
#3 Jose 80 2018-11-25 14:20:00
#4 Matt 20 2018-12-01 22:52:00
#5 Mickey 65 2018-11-29 18:15:00
#6 Peter 20 2018-11-26 12:07:00
#7 Tom 45 2018-12-02 15:27:00
Upvotes: 3
Reputation: 27762
An (of course) this can also be done using data.table
sample data
df <- structure(list(Name = c("Jim", "Jane", "Jose", "Matt", "Mickey",
"Tom", "Peter", "Jane", "Jim", "Jose"), Progress = c("65", "20",
"80", "20", "65", "45", "20", "70", "25", "80"), EndDate = c("11/25/2018 16:45",
"11/25/2018 18:05", "11/25/2018 14:20", "12/1/2018 22:52", "11/29/2018 18:15",
"12/2/2018 15:27", "11/26/2018 12:07", "11/30/2018 11:18", "11/29/2018 18:04",
"11/29/2018 21:12")), row.names = c(NA, -10L), class = "data.frame")
code
#create the data.table (can also be done using setDT(df) )
dt <- as.data.table( df )
#set the dates to a proper POSIXct-format
dt[, EndDate := as.POSIXct( EndDate, format = "%m/%d/%Y %H:%M") ]
#order omn EndDate (by reference!)
setorder( dt, EndDate )
#summarise by Name, if first Progress >70 then keep it, else keep last Progress
dt[ , list( Progress = ifelse( Progress[1] > 70, Progress[1], Progress[.N] ) ), by = .(Name)][]
benchmarks
microbenchmark::microbenchmark(
data.table = {
dt[, EndDate := as.POSIXct( EndDate, format = "%m/%d/%Y %H:%M") ]
setorder( dt, EndDate )
dt[ , list( Progress = ifelse( Progress[1] > 70, Progress[1], Progress[.N] ) ), by = .(Name)][]
},
tidyverse1 = {
df %>%
mutate(EndDate = mdy_hm(EndDate)) %>%
arrange(Name, EndDate) %>%
group_by(Name) %>%
slice(if(first(Progress) > 70) 1 else n())
},
tidyverse2 = {
df %>%
mutate(EndDate = mdy_hm(EndDate)) %>%
group_by(Name) %>%
slice(ifelse(n() > 1,
ifelse(any(Progress > 70), which.min(EndDate), which.max(EndDate)), 1))
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 1.654241 2.030820 2.709023 2.556978 2.782023 30.36590 100
# tidyverse1 6.847731 7.218286 8.742247 7.516838 8.034861 72.00902 100
# tidyverse2 6.173201 6.506398 7.286639 6.764582 7.088591 52.10180 100
Upvotes: 1