Selecting correct join with data.table

Question

I have three data tables (the actual input one is way bigger and performance matters, so I have to use data.table as much as I can):

input <- fread("  ID   | T1 | T2 | T3 |    DATE    
                ACC001 |  1 |  0 |  0 | 31/12/2016 
                ACC001 |  1 |  0 |  1 | 30/06/2017 
                ACC002 |  0 |  1 |  1 | 31/12/2016", sep = "|")

mevs <- fread("  DATE    | INDEX_NAME | INDEX_VALUE 
              31/12/2016 | GDP        |  1.05       
              30/06/2017 | GDP        |  1.06       
              31/12/2017 | GDP        |  1.07       
              30/06/2018 | GDP        |  1.08       
              31/12/2016 | CPI        |  0.02       
              30/06/2017 | CPI        |  0.00       
              31/12/2017 | CPI        | -0.01       
              30/06/2018 | CPI        |  0.01   ", sep = "|")

time <- fread("    DATE   
               31/12/2017 
               30/06/2018 ", sep = "|")

With those, I need to achieve 2 things:

Insert GDP and CPI values from the second dt(mevs) into the first one (input), to make some calculations in the last column based on T1, T2, T3, GDP and CPI.
Make a projection for the time intervals given in the third dt (time), copying T1, T2 and T3 values in the previous interval in the same ID (so ACC001 ones would remain 1, 0, 1) if it exists (filling them with 0 if it doesn't) and getting GDP and CPI from the corresponding dates.

For that, I'm using the following pieces of code:

ones <- input[, .N, by = ID][N == 1, ID]

input[, .SD[time, on = "DATE"], by = ID
      ][dcast(mevs, DATE ~ INDEX_NAME), on = "DATE", `:=` (GDP = i.GDP, CPI = i.CPI)
        ][, (2:4) := lapply(.SD, function(x) if (.BY %in% ones) replace(x, is.na(x), 0L) else zoo::na.locf(x) )
          , by = ID, .SDcols = 2:4][]

Which does (thanks to @Jaap):

input[, .SD[time, on = "DATE"], by = ID] joins for each ID the time data.table to the remaining columns, thus extending the data.table.
A wide version of mevs (dcast(mevs, DATE ~ INDEX_NAME)) is then joined to the extended data.table.
Finally the missing values in the extended data.table are filled with the na.locf-function from the zoo package.

The intended output would be:

       ID T1 T2 T3       DATE  GDP   CPI
1: ACC001  1  0  0 31/12/2016 1.05  0.02
2: ACC001  1  0  1 30/06/2017 1.06  0.00
3: ACC001  1  0  1 31/12/2017 1.07 -0.01
4: ACC001  1  0  1 30/06/2018 1.08  0.01
5: ACC002  0  1  1 31/12/2016 1.05  0.02
6: ACC002  0  0  0 30/06/2017 1.06  0.00
7: ACC002  0  0  0 31/12/2017 1.07 -0.01
8: ACC002  0  0  0 30/06/2018 1.08  0.01

But instead what I get is:

       ID T1 T2 T3       DATE  GDP   CPI
1: ACC001 NA NA NA 31/12/2017 1.07 -0.01
2: ACC001 NA NA NA 30/06/2018 1.08  0.01
3: ACC002 NA NA NA 31/12/2017 1.07 -0.01
4: ACC002 NA NA NA 30/06/2018 1.08  0.01

I'm almost sure that it has to be a wrong join choice between input and time in the first step, but I can't find a workaround for this.

Thanks everyone for your time.

Jaap · Accepted Answer

A possible solution:

times <- unique(rbindlist(list(time, as.data.table(unique(input$DATE))))
                )[, DATE := as.Date(DATE, "%d/%m/%Y")][order(DATE)]
input[, DATE := as.Date(DATE, "%d/%m/%Y")]
mevs[, DATE := as.Date(DATE, "%d/%m/%Y")]

ones <- input[, .N, by = ID][N == 1, ID]

input[, .SD[times, on = "DATE"], by = ID
      ][dcast(mevs, DATE ~ INDEX_NAME), on = "DATE", `:=` (GDP = i.GDP, CPI = i.CPI)
        ][, (2:4) := lapply(.SD, function(x) if (.BY %in% ones) replace(x, is.na(x), 0L) else zoo::na.locf(x) )
          , by = ID, .SDcols = 2:4][]

which gives:

       ID T1 T2 T3       DATE  GDP   CPI
1: ACC001  1  0  0 2016-12-31 1.05  0.02
2: ACC001  1  0  1 2017-06-30 1.06  0.00
3: ACC001  1  0  1 2017-12-31 1.07 -0.01
4: ACC001  1  0  1 2018-06-30 1.08  0.01
5: ACC002  0  1  1 2016-12-31 1.05  0.02
6: ACC002  0  0  0 2017-06-30 1.06  0.00
7: ACC002  0  0  0 2017-12-31 1.07 -0.01
8: ACC002  0  0  0 2018-06-30 1.08  0.01

Selecting correct join with data.table

Answers (1)

Related Questions