Reputation: 455
I have two datasets, which I want to first join and then filter based on conditions, for each group (id, code)
. For each group, if the end_date
is earlier than date
, I want to simply filter for the latest end_date
. Otherwise, I want to filter for the rows where date
is between two columns, start_date
and end_date
.
I have coded this using dplyr
and it works - see below.
left_join(df, df_match, by='id') %>%
group_by(id, code) %>%
mutate(is.max = max(end_date) < date) %>%
filter(case_when(
is.max == T ~ end_date == max(end_date),
is.max == F ~ date >= start_date & date <= end_date
))
However, this code is very slow for my 1+ million row datasets. I am curious if it's possible to achieve the same thing using data.table
, which is usually much faster?
Upvotes: 0
Views: 627
Reputation: 388817
Can't test this without data but data.table
translation of dplyr
code would be -
library(data.table)
setDT(df)
setDT(df_match)
res <- merge(df, df_match, by = 'id')
res[, .SD[if(max(end_date) < date) end_date == max(end_date) else
date >= start_date & date <= end_date], .(id, code)]
Upvotes: 2