rw2
rw2

Reputation: 1783

Left join using R/dplyr with join based on calculation

I am trying to join two tables in R using dplyr, in this way:

Table 1:

ID    CHAR1    CHAR2
01      xyz      abc
02      abc      xyz

etc...

Table 2:

ID    YEAR    VALUE1    STATUS
01    2012      455          T
01    2013       14          B
01    2014      234          C
01    2015       17          T
02    2014      213          B
02    2015      456          B
02    2016       17          B

I have a separate variable called MODEL_YEAR, which in this example is set to 2015.

I want my new table to look like this:

ID  CHAR1  CHAR2  VALUE_Tminus1 VALUE_Tminus2 STATUS_Tminus1 STATUS_Tminus2
01    xyz    abc            234            14              C              B
02    abc    xyz            213            NA              B             NA

The new columns go back in time from the MODEL_YEAR. For example, VALUE_Tminus1 takes VALUE1 from 2014. I would like the new table to go back 10 years, with NA occurring where no data is available.

So far I have tried to do this my adding MODEL_YEAR as a column in Table 1, and then doing a left join Table 1 to Table2, like this:

left_join(Table_1, Table_2, by=c("ID"="ID", "MODEL_YEAR"=("YEAR"-1))

However this doesn't work, as I can't subtract 1 from "YEAR" in this way. I think I'd possibly be able to do this by adding lots of new columns and doing multiple joins, but am not sure and wondered if there's a neater way? Perhaps using data.table - I know this can be used for joins, but I'm unfamiliar with it.

Many thanks

Upvotes: 0

Views: 1046

Answers (1)

s_baldur
s_baldur

Reputation: 33498

Since it has the tag, here is a data.table solution:

Packages

library(data.table)
library(magrittr) # For readability only

Reproducible example of your data

MODEL_YEAR <- 2015L
tab1 <- fread(
  'ID    CHAR1    CHAR2
  01      xyz      abc
  02      abc      xyz',
  colClasses = 'character'
)
tab2 <- fread(
  'ID    YEAR    VALUE1    STATUS
  01    2012      455          T
  01    2013       14          B
  01    2014      234          C
  01    2015       17          T
  02    2014      213          B
  02    2015      456          B
  02    2016       17          B',
  colClasses = c('character', 'integer', 'integer', 'character')
)

Solution:

setkey(tab1, ID)
setkey(tab2, ID, YEAR)
tab2[CJ(ID, YEAR = seq(MODEL_YEAR - 1, MODEL_YEAR - 5), unique = TRUE)] %>% 
  .[, YEAR := paste0('Tminus', MODEL_YEAR - YEAR)] %>%
  dcast(ID ~ YEAR, value.var = c('VALUE1', 'STATUS')) %>% 
  tab1[.]

#    ID CHAR1 CHAR2 VALUE1_Tminus1 VALUE1_Tminus2 VALUE1_Tminus3 VALUE1_Tminus4 VALUE1_Tminus5 STATUS_Tminus1 STATUS_Tminus2 STATUS_Tminus3 STATUS_Tminus4 STATUS_Tminus5
# 1: 01   xyz   abc            234             14            455             NA             NA              C              B              T           <NA>           <NA>
# 2: 02   abc   xyz            213             NA             NA             NA             NA              B           <NA>           <NA>           <NA>           <NA>

To get more lags just change the number 5 to anything you want.

Upvotes: 2

Related Questions