Reputation: 5673

efficient way of selecting rows with a minimum time spacing between dates while grouping

I want to select rows of data with dates such that the dates have a minimum time difference of 3 months. Here is an example:

    patient numsermed       date
 1:       1   numser1 2020-01-08
 2:       2   numser2 2015-01-02
 3:       2   numser2 2019-12-12
 4:       2   numser2 2020-01-05
 5:       2   numser2 2020-01-08
 6:       2   numser2 2020-01-20
 7:       2   numser2 2020-03-15
 8:       2   numser2 2020-03-18
 9:       2   numser3 2020-03-13
10:       2   numser3 2020-03-18
11:       3   numser3 2020-01-22
12:       4   numser4 2018-01-02

I want, by patient and numsermed, keep the date that have at least 3 months difference. I cannot use simply the successive differences. Expected result is:

   patient numsermed       date
1:       1   numser1 2020-01-08
2:       2   numser2 2015-01-02
3:       2   numser2 2019-12-12
4:       2   numser2 2020-03-15
5:       2   numser3 2020-03-13
6:       3   numser3 2020-01-22
7:       4   numser4 2018-01-02

Here, for numsermed2 and patient 2, after 2019-12-12, the next date 3 months a least later is 2020-03-15, that I keep. I thus remove 2020-01-05, 2020-01-08, 2020-01-20.

I then remove 2020-03-18, which is 3 days after 2020-03-15. Here is my solution with data.table:

library(data.table)
library(lubridate)

setkeyv(test,c("numsermed","patient","date"))
test[,indx := .I - .I[1]+1,by = .(numsermed,patient)]

max(test[,.N,by = .(numsermed,patient)]$N)
Nmax <- max(test[,.N,by = .(numsermed,patient)]$N)
test[,supp := 0]

for(i in 1:Nmax){
  test[N>1 ,supp := ifelse(i < indx & date < date[i] + 90,1,0),
       by = .(numsermed,patient)]
  test <- test2[supp != 1  ]
  test[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
}

The idea is for each row, to test the condition and then perform the subset. It seems to work, but on a million row table, it is rather slow (few hours). I am sure there is an efficient way with semi equi join or rolling join in data.table, but I did not manage to write it. Could someone come up with a more efficient solution ? dplyr solutions are of course welcome too.

The data:

library(data.table)
library(lubridate)  test<-setDT(list(patient=c(1:3,2),numsermed=c(paste0("numser",1:3),"numser2"),date=as_date(c("2020-01-08","2020-01-20","2020-01-22","2019-12-12"))))
    test<-rbind(test,data.table(patient=4,numsermed="numser4",date=as_date("2018-01-02")))
    test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2015-01-02")))
    test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-03-15")))
    test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-01-05")))
    test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-01-08")))
    test<-rbind(test,data.table(patient=2,numsermed="numser3",date=as_date("2020-03-13")))
    test<-rbind(test,data.table(patient=2,numsermed="numser3",date=as_date("2020-03-18")))
    test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-03-18")))

Edit

I propose I comparison of the solution proposed, @Ben 's solution, @chinsoon12 's and @astrofunkswag 's .

Here is the test data:

library(data.table)
library(lubridate)
library(magrittr)

set.seed(1234)
origin <- "1970-01-01"
dt <- data.table(numsermed = sample(paste0("numsermed",1:30),10000,replace = T))
dt[,patient := sample(1:10000,.N,replace = T),by = numsermed]
dt[,date := sample((dmy("01.01.2019") %>% as.numeric()):(dmy("01.01.2020") %>% as.numeric()),.N),by = .(patient)]

and here the 4 functions, including mine:

ben = function(dt){
  dt[, c("idx", "date2") := list(.I, date - 90L)]
  dt_final <- unique(dt[dt, on = c(patient = "patient", numsermed = "numsermed", date = "date2"), 
                            roll = -Inf][order(i.date)], by = "idx")
  setorderv(dt_final, c("patient", "numsermed", "i.date"))
  return(dt_final[,.(patient,numsermed,date = i.date)])
}


chinson = function(dt){
  dt[, d := as.integer(date)]
  setkey(dt,date)
  return( dt[dt[, g := findInterval(d, seq(d[1L], d[.N]+90L, by=90L)), .(patient, numsermed)][,
        .I[1L], .(patient, numsermed, g)]$V1][,.(patient,numsermed,date)])
}

sum_reset_at <- function(thresh) {
  function(x) {
    accumulate(x, ~if_else(.x>=thresh, .y, .x+.y))
  }  
}

mon_diff <- function(d1, d2){
  12 * as.numeric((as.yearmon(d1) - as.yearmon(d2)))
}
library(tidyverse); library(zoo)

astrofun = function(dt){
 return(
    dt %>% 
     group_by(patient, numsermed) %>% 
     mutate(diff1 = mon_diff(date, lag(date)),
            diff1 = if_else(is.na(diff1), 300, diff1)) %>% 
     mutate(diff2 = sum_reset_at(3)(diff1)) %>% 
     filter(diff2 >= 3) %>% 
     select(-contains('diff'))
 ) 
}

denis = function(dt){
  df <- copy(dt)
  setkeyv(dt,c("numsermed","patient","date"))
  df[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
  
  df[,N := .N,by = .(numsermed,patient)]
  Nmax <- max(df[,N])
  df[,supp := 0]
  
  for(i in 1:Nmax){
    df[N>1 ,supp := ifelse(i < indx & date < date[i] + 90,1,0),
         by = .(numsermed,patient)]
    df <- df[supp != 1  ]
    df[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
  }
  return(df[,.(patient,numsermed,date)])
}

First, none of them produce the same result! denis(dt) output 9833 lines, ben(dt) 9928, chinson(dt) 9929, and @astrofunkswag solution astrofun(dt) output 9990 lines. I am not sur why this does not produce the same output, nor what solution is the good one (I would say mine just to be pretentious, but I am not even sure).

Then a benchmarking to compare efficiency.

library(microbenchmark)
microbenchmark(ben(dt),
               chinson(dt),
               astrofun(dt),
               denis(dt),times = 10)


Unit: milliseconds
         expr       min        lq       mean    median        uq       max neval
      ben(dt)   17.3841   19.8321   20.88349   20.9609   21.8815   23.5125    10
  chinson(dt)  230.8868  232.6298  275.16637  236.8482  239.0144  544.2292    10
 astrofun(dt) 4460.2159 4565.9120 4795.98600 4631.3251 5007.8055 5687.7717    10
    denis(dt)   68.0480   68.4170   88.88490   80.9636   90.0514  142.9553    10

@Ben 's solution with rolling join is the fastest of course. Mine is not that bad, and @astrofunkswag 's solution is super slow because of the cumulative sum I guess.

Upvotes: 3

Answers (3)

chinsoon12

Reputation: 25225

Another option using findInterval to group:

library(data.table)
DT[, d := as.integer(date)]
DT[DT[, g := findInterval(d, seq(d[1L], d[.N]+90L, by=90L)), .(patient, numsermed)][,
    .I[1L], .(patient, numsermed, g)]$V1]

output:

   patient numsermed       date     d  g
1:       1   numser1 2020-01-08 18269  1
2:       2   numser2 2015-01-02 16437  1
3:       2   numser2 2019-12-12 18242 21
4:       2   numser2 2020-03-15 18336 22
5:       2   numser3 2020-03-13 18334  1
6:       3   numser3 2020-01-22 18283  1
7:       4   numser4 2018-01-02 17533  1

If you have many groups of patient and numsermed, Ben's solution using rolling join will be faster. And another way of coding the rolling join by chaining:

DT[, .(patient, numsermed, date=date+90L)][
    DT, on=.NATURAL, roll=-Inf, .(patient, numsermed, x.date, i.date)][, 
        .(date=i.date[1L]), .(patient, numsermed, x.date)][, 
            x.date := NULL][]

Or more succinctly:

DT[, c("rn", "qtrago") := .(.I, date - 90L)]
DT[DT[DT, on=.(patient, numsermed, date=qtrago), roll=-Inf, unique(rn)]]

data:

library(data.table)
DT <- fread("patient numsermed       date
1   numser1 2020-01-08
2   numser2 2015-01-02
2   numser2 2019-12-12
2   numser2 2020-01-05
2   numser2 2020-01-08
2   numser2 2020-01-20
2   numser2 2020-03-15
2   numser2 2020-03-18
2   numser3 2020-03-13
2   numser3 2020-03-18
3   numser3 2020-01-22
4   numser4 2018-01-02")
DT[, date := as.IDate(date, format="%Y-%m-%d")]

Upvotes: 1

Ben

Reputation: 30474

With data.table you could try the following. This would involve creating a second date 90 days prior and then doing a rolling join.

library(data.table)

setDT(test[, c("idx", "date2") := list(.I, date - 90L)]) 
test_final <- unique(test[test, on = c(patient = "patient", numsermed = "numsermed", date = "date2"), 
                          roll = -Inf][order(i.date)], by = "idx")
setorderv(test_final, c("patient", "numsermed", "i.date"))
test_final

Output

(i.date has the final date desired)

   patient numsermed       date idx      date2     i.date i.idx
1:       1   numser1 2019-10-10   1 2019-10-10 2020-01-08     1
2:       2   numser2 2014-10-04   6 2014-10-04 2015-01-02     6
3:       2   numser2 2019-09-13   4 2019-09-13 2019-12-12     4
4:       2   numser2 2019-12-16   8 2019-10-07 2020-03-15     7
5:       2   numser3 2019-12-14  10 2019-12-14 2020-03-13    10
6:       3   numser3 2019-10-24   3 2019-10-24 2020-01-22     3
7:       4   numser4 2017-10-04   5 2017-10-04 2018-01-02     5

Upvotes: 2

astrofunkswag

Reputation: 2698

Here is a solution with dplyr and purrr. I use 2 helper functions, one to calculate month difference and one to calculate a cumulative sum that resets when a threshold is reached, credit to this post.

I calculate the month difference with the lagging date value, but you want to include the first one which will be NA. One weird part is that to include NA the easiest for me was to convert NA to some value 3 or greater. I arbitrarily made it 300. You could likely modify the sum_reset_at function to handle NA the way you want. You might also want to condense the code in some way since I do multiple mutate calls and then deselect those column, but I did it all in seperate lines to make it more clear what was happening. I think this functional programming solution will be quicker, but I haven't tested it on a large dataset compared to your current solution.

test <- test %>% arrange(patient, numsermed, date)


library(tidyverse); library(zoo)

mon_diff <- function(d1, d2){
  12 * as.numeric((as.yearmon(d1) - as.yearmon(d2)))
}

sum_reset_at <- function(thresh) {
  function(x) {
    accumulate(x, ~if_else(.x>=thresh, .y, .x+.y))
  }  
}

test %>% 
  group_by(patient, numsermed) %>% 
  mutate(diff1 = mon_diff(date, lag(date)),
         diff1 = if_else(is.na(diff1), 300, diff1)) %>% 
  mutate(diff2 = sum_reset_at(3)(diff1)) %>% 
  filter(diff2 >= 3) %>% 
  select(-contains('diff'))


test
    <dbl> <chr>     <date>    
1       1 numser1   2020-01-08
2       2 numser2   2015-01-02
3       2 numser2   2019-12-12
4       2 numser2   2020-03-15
5       2 numser3   2020-03-13
6       3 numser3   2020-01-22
7       4 numser4   2018-01-02

Upvotes: 1

efficient way of selecting rows with a minimum time spacing between dates while grouping

Edit

Answers (3)

Related Questions