fifigoblin
fifigoblin

Reputation: 455

Conditional filtering in groups using data.table in R

I have two datasets, which I want to first join and then filter based on conditions, for each group (id, code). For each group, if the end_date is earlier than date, I want to simply filter for the latest end_date. Otherwise, I want to filter for the rows where date is between two columns, start_date and end_date.

I have coded this using dplyr and it works - see below.

left_join(df, df_match, by='id') %>% 
group_by(id, code) %>%
mutate(is.max = max(end_date) < date) %>%
filter(case_when(
  is.max == T ~ end_date == max(end_date),
  is.max == F ~ date >= start_date & date <= end_date
))

However, this code is very slow for my 1+ million row datasets. I am curious if it's possible to achieve the same thing using data.table, which is usually much faster?

Upvotes: 0

Views: 627

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388817

Can't test this without data but data.table translation of dplyr code would be -

library(data.table)

setDT(df)
setDT(df_match)

res <- merge(df, df_match, by = 'id')

res[, .SD[if(max(end_date) < date) end_date == max(end_date) else 
  date >= start_date & date <= end_date],  .(id, code)]

Upvotes: 2

Related Questions