Reputation: 5673
I want to select rows of data with dates such that the dates have a minimum time difference of 3 months. Here is an example:
patient numsermed date
1: 1 numser1 2020-01-08
2: 2 numser2 2015-01-02
3: 2 numser2 2019-12-12
4: 2 numser2 2020-01-05
5: 2 numser2 2020-01-08
6: 2 numser2 2020-01-20
7: 2 numser2 2020-03-15
8: 2 numser2 2020-03-18
9: 2 numser3 2020-03-13
10: 2 numser3 2020-03-18
11: 3 numser3 2020-01-22
12: 4 numser4 2018-01-02
I want, by patient
and numsermed
, keep the date
that have at least 3 months difference. I cannot use simply the successive differences. Expected result is:
patient numsermed date
1: 1 numser1 2020-01-08
2: 2 numser2 2015-01-02
3: 2 numser2 2019-12-12
4: 2 numser2 2020-03-15
5: 2 numser3 2020-03-13
6: 3 numser3 2020-01-22
7: 4 numser4 2018-01-02
Here, for numsermed2
and patient 2, after 2019-12-12
, the next date 3 months a least later is 2020-03-15
, that I keep. I thus remove 2020-01-05
, 2020-01-08
, 2020-01-20
.
I then remove 2020-03-18
, which is 3 days after 2020-03-15
.
Here is my solution with data.table
:
library(data.table)
library(lubridate)
setkeyv(test,c("numsermed","patient","date"))
test[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
max(test[,.N,by = .(numsermed,patient)]$N)
Nmax <- max(test[,.N,by = .(numsermed,patient)]$N)
test[,supp := 0]
for(i in 1:Nmax){
test[N>1 ,supp := ifelse(i < indx & date < date[i] + 90,1,0),
by = .(numsermed,patient)]
test <- test2[supp != 1 ]
test[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
}
The idea is for each row, to test the condition and then perform the subset. It seems to work, but on a million row table, it is rather slow (few hours). I am sure there is an efficient way with semi equi join or rolling join in data.table
, but I did not manage to write it. Could someone come up with a more efficient solution ? dplyr
solutions are of course welcome too.
The data:
library(data.table)
library(lubridate) test<-setDT(list(patient=c(1:3,2),numsermed=c(paste0("numser",1:3),"numser2"),date=as_date(c("2020-01-08","2020-01-20","2020-01-22","2019-12-12"))))
test<-rbind(test,data.table(patient=4,numsermed="numser4",date=as_date("2018-01-02")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2015-01-02")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-03-15")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-01-05")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-01-08")))
test<-rbind(test,data.table(patient=2,numsermed="numser3",date=as_date("2020-03-13")))
test<-rbind(test,data.table(patient=2,numsermed="numser3",date=as_date("2020-03-18")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-03-18")))
I propose I comparison of the solution proposed, @Ben 's solution, @chinsoon12 's and @astrofunkswag 's .
Here is the test data:
library(data.table)
library(lubridate)
library(magrittr)
set.seed(1234)
origin <- "1970-01-01"
dt <- data.table(numsermed = sample(paste0("numsermed",1:30),10000,replace = T))
dt[,patient := sample(1:10000,.N,replace = T),by = numsermed]
dt[,date := sample((dmy("01.01.2019") %>% as.numeric()):(dmy("01.01.2020") %>% as.numeric()),.N),by = .(patient)]
and here the 4 functions, including mine:
ben = function(dt){
dt[, c("idx", "date2") := list(.I, date - 90L)]
dt_final <- unique(dt[dt, on = c(patient = "patient", numsermed = "numsermed", date = "date2"),
roll = -Inf][order(i.date)], by = "idx")
setorderv(dt_final, c("patient", "numsermed", "i.date"))
return(dt_final[,.(patient,numsermed,date = i.date)])
}
chinson = function(dt){
dt[, d := as.integer(date)]
setkey(dt,date)
return( dt[dt[, g := findInterval(d, seq(d[1L], d[.N]+90L, by=90L)), .(patient, numsermed)][,
.I[1L], .(patient, numsermed, g)]$V1][,.(patient,numsermed,date)])
}
sum_reset_at <- function(thresh) {
function(x) {
accumulate(x, ~if_else(.x>=thresh, .y, .x+.y))
}
}
mon_diff <- function(d1, d2){
12 * as.numeric((as.yearmon(d1) - as.yearmon(d2)))
}
library(tidyverse); library(zoo)
astrofun = function(dt){
return(
dt %>%
group_by(patient, numsermed) %>%
mutate(diff1 = mon_diff(date, lag(date)),
diff1 = if_else(is.na(diff1), 300, diff1)) %>%
mutate(diff2 = sum_reset_at(3)(diff1)) %>%
filter(diff2 >= 3) %>%
select(-contains('diff'))
)
}
denis = function(dt){
df <- copy(dt)
setkeyv(dt,c("numsermed","patient","date"))
df[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
df[,N := .N,by = .(numsermed,patient)]
Nmax <- max(df[,N])
df[,supp := 0]
for(i in 1:Nmax){
df[N>1 ,supp := ifelse(i < indx & date < date[i] + 90,1,0),
by = .(numsermed,patient)]
df <- df[supp != 1 ]
df[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
}
return(df[,.(patient,numsermed,date)])
}
First, none of them produce the same result! denis(dt)
output 9833 lines, ben(dt)
9928, chinson(dt)
9929, and @astrofunkswag solution astrofun(dt)
output 9990 lines. I am not sur why this does not produce the same output, nor what solution is the good one (I would say mine just to be pretentious, but I am not even sure).
Then a benchmarking to compare efficiency.
library(microbenchmark)
microbenchmark(ben(dt),
chinson(dt),
astrofun(dt),
denis(dt),times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
ben(dt) 17.3841 19.8321 20.88349 20.9609 21.8815 23.5125 10
chinson(dt) 230.8868 232.6298 275.16637 236.8482 239.0144 544.2292 10
astrofun(dt) 4460.2159 4565.9120 4795.98600 4631.3251 5007.8055 5687.7717 10
denis(dt) 68.0480 68.4170 88.88490 80.9636 90.0514 142.9553 10
@Ben 's solution with rolling join is the fastest of course. Mine is not that bad, and @astrofunkswag 's solution is super slow because of the cumulative sum I guess.
Upvotes: 3
Views: 115
Reputation: 25225
Another option using findInterval
to group:
library(data.table)
DT[, d := as.integer(date)]
DT[DT[, g := findInterval(d, seq(d[1L], d[.N]+90L, by=90L)), .(patient, numsermed)][,
.I[1L], .(patient, numsermed, g)]$V1]
output:
patient numsermed date d g
1: 1 numser1 2020-01-08 18269 1
2: 2 numser2 2015-01-02 16437 1
3: 2 numser2 2019-12-12 18242 21
4: 2 numser2 2020-03-15 18336 22
5: 2 numser3 2020-03-13 18334 1
6: 3 numser3 2020-01-22 18283 1
7: 4 numser4 2018-01-02 17533 1
If you have many groups of patient and numsermed, Ben's solution using rolling join will be faster. And another way of coding the rolling join by chaining:
DT[, .(patient, numsermed, date=date+90L)][
DT, on=.NATURAL, roll=-Inf, .(patient, numsermed, x.date, i.date)][,
.(date=i.date[1L]), .(patient, numsermed, x.date)][,
x.date := NULL][]
Or more succinctly:
DT[, c("rn", "qtrago") := .(.I, date - 90L)]
DT[DT[DT, on=.(patient, numsermed, date=qtrago), roll=-Inf, unique(rn)]]
data:
library(data.table)
DT <- fread("patient numsermed date
1 numser1 2020-01-08
2 numser2 2015-01-02
2 numser2 2019-12-12
2 numser2 2020-01-05
2 numser2 2020-01-08
2 numser2 2020-01-20
2 numser2 2020-03-15
2 numser2 2020-03-18
2 numser3 2020-03-13
2 numser3 2020-03-18
3 numser3 2020-01-22
4 numser4 2018-01-02")
DT[, date := as.IDate(date, format="%Y-%m-%d")]
Upvotes: 1
Reputation: 30474
With data.table
you could try the following. This would involve creating a second date 90 days prior and then doing a rolling join.
library(data.table)
setDT(test[, c("idx", "date2") := list(.I, date - 90L)])
test_final <- unique(test[test, on = c(patient = "patient", numsermed = "numsermed", date = "date2"),
roll = -Inf][order(i.date)], by = "idx")
setorderv(test_final, c("patient", "numsermed", "i.date"))
test_final
Output
(i.date
has the final date desired)
patient numsermed date idx date2 i.date i.idx
1: 1 numser1 2019-10-10 1 2019-10-10 2020-01-08 1
2: 2 numser2 2014-10-04 6 2014-10-04 2015-01-02 6
3: 2 numser2 2019-09-13 4 2019-09-13 2019-12-12 4
4: 2 numser2 2019-12-16 8 2019-10-07 2020-03-15 7
5: 2 numser3 2019-12-14 10 2019-12-14 2020-03-13 10
6: 3 numser3 2019-10-24 3 2019-10-24 2020-01-22 3
7: 4 numser4 2017-10-04 5 2017-10-04 2018-01-02 5
Upvotes: 2
Reputation: 2698
Here is a solution with dplyr
and purrr
. I use 2 helper functions, one to calculate month difference and one to calculate a cumulative sum that resets when a threshold is reached, credit to this post.
I calculate the month difference with the lagging date value, but you want to include the first one which will be NA
. One weird part is that to include NA
the easiest for me was to convert NA
to some value 3 or greater. I arbitrarily made it 300. You could likely modify the sum_reset_at
function to handle NA
the way you want. You might also want to condense the code in some way since I do multiple mutate
calls and then deselect those column, but I did it all in seperate lines to make it more clear what was happening. I think this functional programming solution will be quicker, but I haven't tested it on a large dataset compared to your current solution.
test <- test %>% arrange(patient, numsermed, date)
library(tidyverse); library(zoo)
mon_diff <- function(d1, d2){
12 * as.numeric((as.yearmon(d1) - as.yearmon(d2)))
}
sum_reset_at <- function(thresh) {
function(x) {
accumulate(x, ~if_else(.x>=thresh, .y, .x+.y))
}
}
test %>%
group_by(patient, numsermed) %>%
mutate(diff1 = mon_diff(date, lag(date)),
diff1 = if_else(is.na(diff1), 300, diff1)) %>%
mutate(diff2 = sum_reset_at(3)(diff1)) %>%
filter(diff2 >= 3) %>%
select(-contains('diff'))
test
<dbl> <chr> <date>
1 1 numser1 2020-01-08
2 2 numser2 2015-01-02
3 2 numser2 2019-12-12
4 2 numser2 2020-03-15
5 2 numser3 2020-03-13
6 3 numser3 2020-01-22
7 4 numser4 2018-01-02
Upvotes: 1