Reputation: 6726
I have a data table and a list of data frame formatted as follows:
require(data.table)
members = c('a','b','c')
DT = do.call('rbind',
lapply(members, function(x){
date = seq(as.Date("2015/1/1"), as.Date("2015/12/31"), 'days')
dummy = sample(length(date))
dt = data.table(member=sample(x, length(dummy), replace=TRUE), date=date, dummy=dummy)}
)
)
date = seq(as.Date("2015/1/1"), as.Date("2015/12/31"), 'days')
l.members = lapply(members, function(x){
n.period = sample(10,1)
do.call('rbind',
lapply(1:n.period, function(y){
period = sample(date, 2)
if (period[1]>period[2]){
start=period[2]
due=period[1]
}else{
start=period[1]
due=period[2]
}
return(data.frame(start.date=start, due.date=due))}
)
)
})
names(l.members) = members
DT
is the large (about 4G as csv file) data table from which I want to subset based on the l.members
. Every name of entry x in l.members
is one of the members in unique(DT$member)
. In every entry there is a data frame with each row representing a period [p1, p2], based on which I want to subset the rows in DT
with DT$member
is x and DT$date
=> p1 and DT$date
<= p2. The current workaround is as follows:
l.member.periods = lapply(members,
function(x){
DT.member = DT[member==x]
apply(l.members[[x]], 1,
function(y){
start = y[1]
due = y[2]
return(DT.member[date>=start&date<=due])
}
)
}
)
It takes decades when there are about 5000 entries in l.members
, each of which has mostly 10 rows (periods). I've tried replacing lapply
with mclapply
but doesn't seem to work, it ends up drying the memory and hanging. How can I speed up the process?
Upvotes: 1
Views: 456
Reputation: 7373
You can use foverlaps
.
First, you need to store l.members
as a data.table
lmembers <- rbindlist(lapply(1:length(l.members),
function(i)data.table(member=names(l.members)[i],
l.members[[i]],
keep.rownames = TRUE)))
> head(lmembers)
member rn start.date due.date
1: a 1 2015-03-30 2015-04-29
2: a 2 2015-03-25 2015-12-07
3: a 3 2015-02-06 2015-03-01
4: a 4 2015-09-19 2015-11-08
5: a 5 2015-06-23 2015-08-27
6: a 6 2015-04-22 2015-10-08
Next step is the obvious use of foverlaps
.
setkey(lmembers, "member", "start.date", "due.date")
DT[, date1:=date,]
setkey(DT, "member","date", "date1")
lmemberperiods <- foverlaps(lmembers, DT)[, .(member, rn, date, dummy)]
Check that this results in a desired outcome.
lmemberperiods[member=="a" & rn==1]
l.member.periods[[1]][[1]]
Upvotes: 3