Conditional join in r

Question

I would like to conditionally join two data tables together:

library(data.table)
set.seed(1)

key.table <- 
  data.table(
    out = (0:10)/10,
    keyz = sort(runif(11))
  )

large.tbl <- 
  data.table(
    ab = rnorm(1e6),
    cd = runif(1e6)
  )

according to the following rule: match the smallest value of out in key.table whose keyz value is larger than cd. I have the following:

library(dplyr)
large.tbl %>%
  rowwise %>%
  mutate(out = min(key.table$out[key.table$keyz > cd]))

which provides the correct output. The problem I have is that the rowwise operation seems expensive for the large.tbl I am actually using, crashing it unless it is on a particular computer. Are there less memory-expensive operations? The following seems slightly faster, but not enough for the problem I have.

large.tbl %>%
    group_by(cd) %>%
    mutate(out = min(key.table$out[key.table$keyz > cd]))

This smells like a problem with a data.table answer, but the answer does not have to use that package.

MichaelChirico · Accepted Answer

What you want is:

setkey(large.tbl, cd)
setkey(key.table, keyz)
key.table[large.tbl, roll = -Inf]

See ?data.table>roll:

Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i's row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates in x's key, the last key column is a date (or time, or datetime) and all the columns of x's key are joined to. A common idiom is to select a contemporaneous regular time series (dts) across a set of identifiers (ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date) and CJ stands for cross join. When roll is a positive number, this limits how far values are carried forward. roll=TRUE is equivalent to roll=+Inf. When roll is a negative number, values are rolled backwards; i.e., next observation carried backwards (NOCB). Use -Inf for unlimited roll back. When roll is "nearest", the nearest value is joined to.

(to be fair I think this could go for some elucidation, it's pretty dense)

Conditional join in r

Answers (2)

Related Questions