Alex
Alex

Reputation: 15708

How do you change the data table rolling join condition from weak inequality to strict inequality?

Consider the two following two datasets, with a 'time' column which represents a general timestamp, integers are used for simplicity of illustrating the example:

library(data.table)

df_test_1 <- 
    data.table(time = c(1:10, seq(20, 30, by = 5))) %>%
df_test_1$values <- -df_test_1$time
df_test_1 <- setkey(df_test_1, time)

df_test_2 <-
    data.table(time = c(15, 20, 26, 28, 31))
df_test_2 <- setkey(df_test_2, time)

so that:

> df_test_1
    time values
 ...
 5:    5     -5
 6:    6     -6
 7:    7     -7
 8:    8     -8
 9:    9     -9
10:   10    -10
11:   20    -20
12:   25    -25
13:   30    -30

and:

> df_test_2

   time
1:   15
2:   20
3:   26
4:   28
5:   31

The rolling join df_test_1[df_test_2, roll = -Inf] produces:

> df_test_1[df_test_2, roll = -Inf]
   time values
1:   15    -20
2:   20    -20
3:   26    -30
4:   28    -30
5:   31     NA

That is, for each time value in df_test_1, find all time values in df_test_2 smaller than or equal to it, and associate the corresponding value to this row of df_test_2. For example, df_test_1$time == 20 matches the time values 15 and 20 in df_test_2$time, thus the corresponding value of -20 is associated to these rows of df_test_2.

I would like to change the join condition (in bold above) to smaller than it, that is, the answer produced should be:

   time values
1:   15    -20
2:   20    -25
3:   26    -30
4:   28    -30
5:   31     NA

The difference here is that the value at df_test_1$time == 25 should be matched to the row of df_test_2 where df_test_2$time == 20.

An alternate way of producing the desired result would be to take away a minuscule portion from time:

df_test_3 <-
    df_test_1 %>%
    mutate(time = time - 0.1) %>%
    setkey(time)

so that:

> df_test_3[df_test_2, roll = -Inf]
   time values
1:   15    -20
2:   20    -25
3:   26    -30
4:   28    -30
5:   31     NA

Upvotes: 3

Views: 395

Answers (1)

Arun
Arun

Reputation: 118799

Using the new non-equi joins feature in the current development version of data.table, this is straightforward:

# v1.9.7+
df_test_1[df_test_2, on=.(time > time), mult="first"]

Keyed joins are capable of only equi joins. on argument is essential for conditional joins.

Note that there's no need for data.tables to be keyed if the on argument is used. Even if you wish to key the data.tables, specifying on is better as it helps understand the code immediately at a later point.

See the installation instructions for devel version here.

Upvotes: 2

Related Questions