Mihail
Mihail

Reputation: 827

Find matches to several tables: conditional (full) join using data.table

This might have an easy solution, but I can't seem to crack it.

For an example, suppose I have a table listing purchases and customer details:

library(data.table)
purchase <- setDT(structure(list(Name = c("John", "John", "Mary"), Surname = c("Smith", 
"Smith", "Jane"), PurchaseDate = c("2017-01-01", "2015-01-01", 
"2017-01-02")), .Names = c("Name", "Surname", "PurchaseDate"), row.names = c(NA, 
-3L), class = c("data.table", "data.frame")))

> purchase
   Name Surname PurchaseDate
1: John   Smith   2017-01-01
2: John   Smith   2015-01-01
3: Mary    Jane   2017-01-02

And I want to find if these customers held a valid discount card at a time of purchase, which matches data held in two databases:

df1 <- setDT(structure(list(Name = "John", Surname = "Smith", ValidFrom = "2016-12-31", 
    ValidTo = "2017-01-02"), .Names = c("Name", "Surname", "ValidFrom", 
"ValidTo"), row.names = c(NA, -1L), class = c("data.table", "data.frame")))

df2 <- setDT(structure(list(Name = "Mary", Surname = "Jane", ValidFrom = "2017-01-01", 
    ValidTo = "2017-01-03"), .Names = c("Name", "Surname", "ValidFrom", 
"ValidTo"), row.names = c(NA, -1L), class = c("data.table", "data.frame")))

> df1
   Name Surname  ValidFrom    ValidTo
1: John   Smith 2016-12-31 2017-01-02
> df2
   Name Surname  ValidFrom    ValidTo
1: Mary    Jane 2017-01-01 2017-01-03

I am adapting this solution, which uses data.table

library(data.table)
purchase[df1, on=c(Name='Name', Surname='Surname'), Match := 'Yes']
purchase[df2, on=c(Name='Name', Surname='Surname'), Match := 'Yes']

The results of this (based on left join) are saved to the Match variable in the original purchase table. (Importantly, this does not require creating a new object, but saves the result to the original object, otherwise it would become messy.)

> purchase
   Name Surname PurchaseDate Match
1: John   Smith   2017-01-01   Yes
2: John   Smith   2015-01-01   Yes
3: Mary    Jane   2017-01-02   Yes

But, I also need to check that the PurchaseDate is within ValidFrom and ValidTo dates, and don't have a good idea of how to do this.

To do this, I can bring in the ValidFrom and ValidTo dates to the join and later determine whether the purchase was between those dates using ifelse.

purchase[df1, on=c(Name='Name', Surname='Surname'), `:=`(Match='Yes', VFrom=ValidFrom, VTo=ValidTo)]
purchase[df2, on=c(Name='Name', Surname='Surname'), `:=`(Match='Yes', VFrom=ValidFrom, VTo=ValidTo)]

Great! This brings the dates:

   Name Surname PurchaseDate Match      VFrom        VTo
1: John   Smith   2017-01-01   Yes 2016-12-31 2017-01-02
2: John   Smith   2015-01-01   Yes 2016-12-31 2017-01-02
3: Mary    Jane   2017-01-02   Yes 2017-01-01 2017-01-03

But, the problem arises if customer had two discount cards, and a purchase falls in the period of validity of only one of them. Suppose Mary had two cards:

df2 <- setDT(structure(list(Name = structure(c(1L, 1L), .Label = "Mary", class = "factor"), 
    Surname = structure(c(1L, 1L), .Label = "Jane", class = "factor"), 
    ValidFrom = structure(1:2, .Label = c("2017-01-01", "1945-01-01"
    ), class = "factor"), ValidTo = structure(1:2, .Label = c("2017-01-03", 
    "1946-01-01"), class = "factor")), .Names = c("Name", "Surname", 
"ValidFrom", "ValidTo"), row.names = c(NA, -2L), class = c("data.table", "data.frame")))

> df2
   Name Surname  ValidFrom    ValidTo
1: Mary    Jane 2017-01-01 2017-01-03
2: Mary    Jane 1945-01-01 1946-01-01

Running this

purchase[df2, on=c(Name='Name', Surname='Surname'), `:=`(Match='Yes', VFrom=ValidFrom, VTo=ValidTo)]

Brings only one of these pairs of dates (apparently the earlies one, regardless of the row number).

   Name Surname PurchaseDate Match      VFrom        VTo
1: John   Smith   2017-01-01   Yes 2016-12-31 2017-01-02
2: John   Smith   2015-01-01   Yes 2016-12-31 2017-01-02
3: Mary    Jane   2017-01-02   Yes 1945-01-01 1946-01-01

How would I bring in all matched rows?

From what I learned, X[Y] syntax supports appending to the original object (which I need), and also the := function, which I need, but does not support a full join. An alternative merge supports full joins, but requires creation of new object at every join-step (will be extremely messy), and does not support :=. Any ideas? Is there a way to use foverlaps somehow?

Upvotes: 1

Views: 114

Answers (1)

Frank
Frank

Reputation: 66819

Here's one way of approaching it:

# clean data
purchase[, PurchaseDate := as.IDate(PurchaseDate)]
df1[, `:=`(ValidFrom = as.IDate(ValidFrom), ValidTo = as.IDate(ValidTo))]
df2[, `:=`(ValidFrom = as.IDate(ValidFrom), ValidTo = as.IDate(ValidTo))]

# initialize
purchase[, matched := FALSE ]

# update joins
purchase[!(matched), matched := 
  df1[.SD, on=.(Name, Surname, ValidFrom <= PurchaseDate, ValidTo >= PurchaseDate), 
    .N, by=.EACHI ]$N > 0L
]
purchase[!(matched), matched := 
  df2[.SD, on=.(Name, Surname, ValidFrom <= PurchaseDate, ValidTo >= PurchaseDate), 
    .N, by=.EACHI ]$N > 0L
]

I'm keeping df1 and df2 separate since the OP mentioned that their join rules differ in their actual use-case.


How it works

The overall structure is...

DT[, matched := FALSE ]
DT[!(matched), matched := expr1 ]
DT[!(matched), matched := expr2 ]

so we're initializing matched to false; and at each later step, updating unmatched rows, !(matched).

The expressions start with DT2[.SD, ...], which is just a join against the Subset of Data we have after filtering with !(matched). Joins like this look up rows of .SD in DT2 according to the on= filters. In this case, the on= filters are associated with a non equi join.***

When we use by=.EACHI we are grouping by each row of .SD. With .N, by=.EACHI, we get the number of DT2 rows matched for each row of .SD.

Once we have the number of matched rows, we can compare N > 0L to update matched.


*** Unfortunately, as of April 2017 there's an open bug in this pattern of usage that sometimes gives an error about .SD. The workaround is to replace .SD with copy(.SD).

Upvotes: 2

Related Questions