Reputation: 2166
I have an R data.table that is structured as follows:
> str(dat)
Classes ‘data.table’ and 'data.frame': 26802896 obs. of 4 variables:
$ id : chr ...
$ date1 : Date, format: "2011-12-15" "2012-11-02" ...
$ date2: Date, format: "2010-08-15" "2011-01-04" ...
$ row_name : chr ...
My goal is to create a new variable matching_row_name
with the row name of a row for which date1-date2>0 days & date1-date2<30 days
stratified by id
. In the case of multiple matches, I would like use the row_name
of the min(date1-date2)
for the matching variables. There are no ties.
I have made a reproducible sample data.table
as follows:
latemail <- function(N, st="2012/01/01", et="2013/12/31") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
set.seed(1)
date1=latemail(1000, st="2012/01/01", et="2013/12/31")
set.seed(2)
date2=latemail(1000, st="2012/01/02", et="2013/12/31")
set.seed(3)
ids=sample(letters[1:10],100,replace=TRUE)
dat=data.table(date1=date1,date2=date2,id=ids,row_name=seq(1:1000))
dat=dat[date1<date2]
It looks like this
> dat
date1 date2 id row_name
1: 2012-01-01 18:01:58 2012-01-02 06:36:13 b 1
2: 2012-01-02 03:10:54 2012-01-03 14:57:18 i 2
3: 2012-01-02 04:51:47 2012-01-04 03:47:44 d 3
4: 2012-01-06 17:24:37 2012-01-06 23:12:37 g 5
5: 2012-01-08 22:20:21 2012-01-09 09:12:45 f 9
I have tried things like the following:
test_function=function(date1="date1",date2="date2"){return(which(as.numeric((date1-date2))==as.numeric(min(date1-date2))))}
dat=dat[,test:=lapply(.SD,test_function), by =id, .SDcols = c("date1","date2")]
to no avail.
The ideal output would be something like this (note i made up the value for row name 2 in this example):
date1 date2 id row_name matching_row_name
1: 2012-01-01 18:01:58 2012-01-02 06:36:13 b 1 32
Or if there is no second date in the range, then
date1 date2 id row_name matching_row_name
1: 2012-01-01 18:01:58 2012-01-02 06:36:13 b 1 NA
Upvotes: 1
Views: 114
Reputation: 4643
I made several assumptions as some details were not clear from the questions. As there are no positive date1 - date2
values, I took date2 - date1
. Also I took minutes as time units. Then my answer is:
f <- function(date1, date2) {
dd <- as.numeric(difftime(date2, date1, units = 'days'))
id <- which(dd > 0 & dd < 30)
n <- length(id)
if(n >= 1) which.min(dd)
else if (n < 1) NA_integer_
}
dat[, matching_row_name := row_name[f(date1, date2)], by = id]
In case of additional clarifications I will update the answer.
Upvotes: 2