Optimization of Data.Table in R?

Question

I have two tables in R that look as the following:

DT.Purchase <- data.frame( ID = c(1,1,1,2,2,3,3,3,3,3,4,4,4,4),
                          CDS = c("0389","0389", "3298", "4545", "1282", "4545", 
                                  "0389","0389", "5685", "4545", "1282", "0389", 
                                  "1282", "1282")
                         Date = c("5/28/2016","5/26/2016","8/9/2016","2/2/2015", 
                                  "2/24/2015", "9/27/2015", "9/27/2015", "9/5/2015", 
                                  "3/3/2016", "4/9/2014", "5/1/2014", "5/4/2014",
                                  "6/9/2014", "7/7/2014"),   
                          JFK = c(T,F,F,F,T,T,F,F,T,F,T,T,T,F),
                          RFK = c(F,T,T,F,T,F,F,F,F,T,T,T,T,T), 
                          RUG = c(T,F,T,F,T,F,F,F,F,T,F,F,T,T),
                          LPG = c(T,T,T,F,F,T,T,F,F,F,F,F,T,F))


DT.Purchase$Date <- as.Date(DT.Purchase$Date, format = "%m/%d/%Y")
DT.Purchase      <- data.table(DT.Purchase)
ID  CDS     Date    JFK RFK RUG LPG
1   0389    5/28/2016   T   F   T   T
1   0389    5/26/2016   F   T   F   T
1   3298    8/9/2016    F   T   T   T
2   4545    2/2/2015    F   F   F   F
2   1282    2/24/2015   T   T   T   F
3   4545    9/27/2015   T   F   F   T
3   0389    9/27/2015   F   F   F   T
3   0389    9/5/2015    F   F   F   F
3   5685    3/3/2016    T   F   F   F
3   4545    4/9/2014    F   T   T   F
4   1282    5/1/2014    T   T   F   F
4   0389    5/4/2014    T   T   F   F
4   1282    6/9/2014    T   T   T   T
4   1282    7/7/2014    F   T   T   F

DT.Stay <- data.frame(Stay.ID = c(1,2,3,5,6,9,10,11), 
                            ID = c(1,1,2,3,3,3,4,4), 
                    Start.Date = c('5/26/2016','8/1/2016', '2/1/2015', '3/1/2016', 
                                   '9/1/2015', '4/9/2014', '4/7/2014','6/1/2014'),
                    End.Date   = c('6/6/2016','9/1/2016','3/1/2015','3/7/2016',
                                   '9/30/2015','4/14/2014','5/9/2014','7/11/2014'))
DT.Stay$Start.Date <- as.Date(DT.Stay$Start.Date, format = "%m/%d/%Y")
DT.Stay$End.Date <- as.Date(DT.Stay$End.Date, format = "%m/%d/%Y")

DT.Stay <- data.table(DT.Stay)

Stay.ID ID  Start.Date  End.Date
1   1   5/26/2016   6/6/2016
2   1   8/1/2016    9/1/2016
3   2   2/1/2015    3/1/2015
5   3   3/1/2016    3/7/2016
6   3   9/1/2015    9/30/2015
9   3   4/9/2014    4/14/2014
10  4   4/7/2014    5/9/2014
11  4   6/1/2014    7/11/2014

Now in reality, DT.Purchase is much larger (10 million observations) and DT.Stay is over 50000 observations. There are weights for DT.Purchase[ ,.(JFK, RFK, DUG, LPG)] which equals c.weights = c(1,2,1,3). These weights represent an in-house cost weight that we are considering. The goal of this is to determine purchasing deals based on previous purchases and the weights. What I would like to do is determine the previous cost.index which is the sum of all previous purchases based on the weights and previous to each End.Date for each Stay.ID. So the final data.table should look like

Stay.ID cost.index
1        10
2        16
3        4
5        11
6        10
9        3
10       6
11       10

The way I have done this involves merging according to ID (allow.cartesian = TRUE) the two datasets and checking to see if Date <= End.Date. Then I substitute the weights in and sum for each Stay.ID. It works but I am looking for a quicker way to do this. With 10 million and 50000 observations the merging becomes time and resource consuming.

eddi · Accepted Answer

With the latest devel version (1.9.7+), smth like this will work:

DT.Purchase[DT.Stay, on = .(ID = ID, Date >= Start.Date, Date <= End.Date),
            .(Stay.ID, sum(as.matrix(.SD) %*% c.weights)),
            by = .EACHI, .SDcols = JFK:LPG]

Assuming your dates are in either Date or IDate format.

In 1.9.6 you can use foverlaps instead:

foverlaps(setkey(DT.Purchase[, Date2 := Date], ID, Date, Date2),
          setkey(DT.Stay, ID, Start.Date, End.Date))[,
  sum(as.matrix(.SD) %*% c.weights), keyby = Stay.ID, .SDcols = JFK:LPG]

Optimization of Data.Table in R?

Answers (1)

Related Questions