Find matches to several tables: conditional (full) join using data.table

Question

This might have an easy solution, but I can't seem to crack it.

For an example, suppose I have a table listing purchases and customer details:

library(data.table)
purchase <- setDT(structure(list(Name = c("John", "John", "Mary"), Surname = c("Smith", 
"Smith", "Jane"), PurchaseDate = c("2017-01-01", "2015-01-01", 
"2017-01-02")), .Names = c("Name", "Surname", "PurchaseDate"), row.names = c(NA, 
-3L), class = c("data.table", "data.frame")))

> purchase
   Name Surname PurchaseDate
1: John   Smith   2017-01-01
2: John   Smith   2015-01-01
3: Mary    Jane   2017-01-02

And I want to find if these customers held a valid discount card at a time of purchase, which matches data held in two databases:

df1 <- setDT(structure(list(Name = "John", Surname = "Smith", ValidFrom = "2016-12-31", 
    ValidTo = "2017-01-02"), .Names = c("Name", "Surname", "ValidFrom", 
"ValidTo"), row.names = c(NA, -1L), class = c("data.table", "data.frame")))

df2 <- setDT(structure(list(Name = "Mary", Surname = "Jane", ValidFrom = "2017-01-01", 
    ValidTo = "2017-01-03"), .Names = c("Name", "Surname", "ValidFrom", 
"ValidTo"), row.names = c(NA, -1L), class = c("data.table", "data.frame")))

> df1
   Name Surname  ValidFrom    ValidTo
1: John   Smith 2016-12-31 2017-01-02
> df2
   Name Surname  ValidFrom    ValidTo
1: Mary    Jane 2017-01-01 2017-01-03

I am adapting this solution, which uses data.table

library(data.table)
purchase[df1, on=c(Name='Name', Surname='Surname'), Match := 'Yes']
purchase[df2, on=c(Name='Name', Surname='Surname'), Match := 'Yes']

The results of this (based on left join) are saved to the Match variable in the original purchase table. (Importantly, this does not require creating a new object, but saves the result to the original object, otherwise it would become messy.)

> purchase
   Name Surname PurchaseDate Match
1: John   Smith   2017-01-01   Yes
2: John   Smith   2015-01-01   Yes
3: Mary    Jane   2017-01-02   Yes

But, I also need to check that the PurchaseDate is within ValidFrom and ValidTo dates, and don't have a good idea of how to do this.

To do this, I can bring in the ValidFrom and ValidTo dates to the join and later determine whether the purchase was between those dates using ifelse.

purchase[df1, on=c(Name='Name', Surname='Surname'), `:=`(Match='Yes', VFrom=ValidFrom, VTo=ValidTo)]
purchase[df2, on=c(Name='Name', Surname='Surname'), `:=`(Match='Yes', VFrom=ValidFrom, VTo=ValidTo)]

Great! This brings the dates:

   Name Surname PurchaseDate Match      VFrom        VTo
1: John   Smith   2017-01-01   Yes 2016-12-31 2017-01-02
2: John   Smith   2015-01-01   Yes 2016-12-31 2017-01-02
3: Mary    Jane   2017-01-02   Yes 2017-01-01 2017-01-03

But, the problem arises if customer had two discount cards, and a purchase falls in the period of validity of only one of them. Suppose Mary had two cards:

df2 <- setDT(structure(list(Name = structure(c(1L, 1L), .Label = "Mary", class = "factor"), 
    Surname = structure(c(1L, 1L), .Label = "Jane", class = "factor"), 
    ValidFrom = structure(1:2, .Label = c("2017-01-01", "1945-01-01"
    ), class = "factor"), ValidTo = structure(1:2, .Label = c("2017-01-03", 
    "1946-01-01"), class = "factor")), .Names = c("Name", "Surname", 
"ValidFrom", "ValidTo"), row.names = c(NA, -2L), class = c("data.table", "data.frame")))

> df2
   Name Surname  ValidFrom    ValidTo
1: Mary    Jane 2017-01-01 2017-01-03
2: Mary    Jane 1945-01-01 1946-01-01

Running this

purchase[df2, on=c(Name='Name', Surname='Surname'), `:=`(Match='Yes', VFrom=ValidFrom, VTo=ValidTo)]

Brings only one of these pairs of dates (apparently the earlies one, regardless of the row number).

   Name Surname PurchaseDate Match      VFrom        VTo
1: John   Smith   2017-01-01   Yes 2016-12-31 2017-01-02
2: John   Smith   2015-01-01   Yes 2016-12-31 2017-01-02
3: Mary    Jane   2017-01-02   Yes 1945-01-01 1946-01-01

How would I bring in all matched rows?

From what I learned, X[Y] syntax supports appending to the original object (which I need), and also the := function, which I need, but does not support a full join. An alternative merge supports full joins, but requires creation of new object at every join-step (will be extremely messy), and does not support :=. Any ideas? Is there a way to use foverlaps somehow?

Frank · Accepted Answer

Here's one way of approaching it:

# clean data
purchase[, PurchaseDate := as.IDate(PurchaseDate)]
df1[, `:=`(ValidFrom = as.IDate(ValidFrom), ValidTo = as.IDate(ValidTo))]
df2[, `:=`(ValidFrom = as.IDate(ValidFrom), ValidTo = as.IDate(ValidTo))]

# initialize
purchase[, matched := FALSE ]

# update joins
purchase[!(matched), matched := 
  df1[.SD, on=.(Name, Surname, ValidFrom <= PurchaseDate, ValidTo >= PurchaseDate), 
    .N, by=.EACHI ]$N > 0L
]
purchase[!(matched), matched := 
  df2[.SD, on=.(Name, Surname, ValidFrom <= PurchaseDate, ValidTo >= PurchaseDate), 
    .N, by=.EACHI ]$N > 0L
]

I'm keeping df1 and df2 separate since the OP mentioned that their join rules differ in their actual use-case.

How it works

The overall structure is...

DT[, matched := FALSE ]
DT[!(matched), matched := expr1 ]
DT[!(matched), matched := expr2 ]

so we're initializing matched to false; and at each later step, updating unmatched rows, !(matched).

The expressions start with DT2[.SD, ...], which is just a join against the Subset of Data we have after filtering with !(matched). Joins like this look up rows of .SD in DT2 according to the on= filters. In this case, the on= filters are associated with a non equi join.***

When we use by=.EACHI we are grouping by each row of .SD. With .N, by=.EACHI, we get the number of DT2 rows matched for each row of .SD.

Once we have the number of matched rows, we can compare N > 0L to update matched.

*** Unfortunately, as of April 2017 there's an open bug in this pattern of usage that sometimes gives an error about .SD. The workaround is to replace .SD with copy(.SD).

Find matches to several tables: conditional (full) join using data.table

Answers (1)

Related Questions