Francis
Francis

Reputation: 6726

R - Fast subsetting large data table conditioned on a list of data frame

I have a data table and a list of data frame formatted as follows:

require(data.table)
members = c('a','b','c')
DT = do.call('rbind',
         lapply(members, function(x){
             date = seq(as.Date("2015/1/1"), as.Date("2015/12/31"), 'days')
             dummy = sample(length(date))
             dt = data.table(member=sample(x, length(dummy), replace=TRUE), date=date, dummy=dummy)} 
         )
     )

date = seq(as.Date("2015/1/1"), as.Date("2015/12/31"), 'days')
l.members = lapply(members, function(x){
                n.period = sample(10,1)
                do.call('rbind',
                    lapply(1:n.period, function(y){
                        period = sample(date, 2)
                        if (period[1]>period[2]){
                            start=period[2]
                            due=period[1]
                        }else{
                            start=period[1]
                            due=period[2]
                        }
                        return(data.frame(start.date=start, due.date=due))}
                    )
                )
            })
names(l.members) = members

DT is the large (about 4G as csv file) data table from which I want to subset based on the l.members. Every name of entry x in l.members is one of the members in unique(DT$member). In every entry there is a data frame with each row representing a period [p1, p2], based on which I want to subset the rows in DT with DT$member is x and DT$date => p1 and DT$date <= p2. The current workaround is as follows:

l.member.periods = lapply(members,
                       function(x){
                           DT.member = DT[member==x]
                           apply(l.members[[x]], 1,
                               function(y){
                                   start = y[1]
                                   due = y[2]
                                   return(DT.member[date>=start&date<=due])
                               }
                           )
                        }
                   )

It takes decades when there are about 5000 entries in l.members, each of which has mostly 10 rows (periods). I've tried replacing lapply with mclapply but doesn't seem to work, it ends up drying the memory and hanging. How can I speed up the process?

Upvotes: 1

Views: 456

Answers (1)

Khashaa
Khashaa

Reputation: 7373

You can use foverlaps. First, you need to store l.members as a data.table

lmembers <- rbindlist(lapply(1:length(l.members), 
                             function(i)data.table(member=names(l.members)[i], 
                                                   l.members[[i]], 
                                                   keep.rownames = TRUE)))

> head(lmembers)
   member rn start.date   due.date
1:      a  1 2015-03-30 2015-04-29
2:      a  2 2015-03-25 2015-12-07
3:      a  3 2015-02-06 2015-03-01
4:      a  4 2015-09-19 2015-11-08
5:      a  5 2015-06-23 2015-08-27
6:      a  6 2015-04-22 2015-10-08

Next step is the obvious use of foverlaps.

setkey(lmembers, "member", "start.date", "due.date")
DT[, date1:=date,]
setkey(DT, "member","date", "date1")
lmemberperiods <- foverlaps(lmembers, DT)[, .(member, rn, date, dummy)]

Check that this results in a desired outcome.

lmemberperiods[member=="a" & rn==1]
l.member.periods[[1]][[1]]

Upvotes: 3

Related Questions