richSC
richSC

Reputation: 45

Calculate difference in date between rows

I have a data.table where I would like to flag entries that are within 90 days of a prior entry for a given group id. Context is that these are buy signals for trades. So I don't want duplicates within a 90 day window, since I am assuming I hold the position for 90 days, and thus would already have bought a position (and I don't want to restart the clock).

So I have:

library(data.table)
> dt <- data.table(id = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), date = as.Date(c("2017-01-01", "2017-02-01", "2017-05-01", "2017-01-01", "2017-05-01", "2017-10-01", "2017-01-01", "2017-02-01", "2017-02-15")))
> dt
   id       date
1:  A 2017-01-01
2:  A 2017-02-01
3:  A 2017-05-01
4:  B 2017-01-01
5:  B 2017-05-01
6:  B 2017-10-01
7:  C 2017-01-01
8:  C 2017-02-01
9:  C 2017-02-15

And I want to get:

> dt2
   id       date with.90d
1:  A 2017-01-01    FALSE
2:  A 2017-02-01     TRUE
3:  A 2017-05-01    FALSE
4:  B 2017-01-01    FALSE
5:  B 2017-05-01    FALSE
6:  B 2017-10-01    FALSE
7:  C 2017-01-01    FALSE
8:  C 2017-02-01     TRUE
9:  C 2017-02-15     TRUE

I feel like I should be able to do this with .SD, but I can't quite figure it out. Thanks for any help!

Upvotes: 2

Views: 2201

Answers (4)

Uwe
Uwe

Reputation: 42544

The OP has requested:

From first observation X in each group, I want to flag any other observations less than 90 days away from X. Then for the next day observation that is greater than 90 days away from X, call it observation Y, I want to flag any observations within 90 days of Y. Repeat.

If I understood correctly from the expected result, a value of FALSE in the with.90d column indicates the beginn of 90 day period.

Unfortunately, the beginn of the next 90 day period depends on the date of the next observation after expiration of the previous 90 day period. Therefore, we cannot use fixed 90 day intervals starting with the first date in each group.

I have tried to find solutions using non-equi joins or rolling joins but I ended up so far with a recursive approach:

dt3[, with.90d := NA]
while (dt3[, any(is.na(with.90d))]) 
  dt3[is.na(with.90d), cd := date - min(date), by = id][
    is.na(with.90d) & cd == 0, with.90d := FALSE][
      is.na(with.90d) & cd <= 90, with.90d := TRUE]
dt3
    id       date with.90d      cd
 1:  A 2017-01-01    FALSE  0 days
 2:  A 2017-02-01     TRUE 31 days
 3:  A 2017-05-01    FALSE  0 days
 4:  B 2017-01-01    FALSE  0 days
 5:  B 2017-05-01    FALSE  0 days
 6:  B 2017-10-01    FALSE  0 days
 7:  C 2017-01-01    FALSE  0 days
 8:  C 2017-02-01     TRUE 31 days
 9:  C 2017-02-15     TRUE 45 days
10:  D 2017-03-01    FALSE  0 days
11:  D 2017-04-01     TRUE 31 days
12:  D 2017-05-01     TRUE 61 days
13:  D 2017-06-01    FALSE  0 days
14:  D 2017-07-01     TRUE 30 days
15:  D 2017-08-01     TRUE 61 days
16:  E 2017-01-01    FALSE  0 days
17:  E 2017-02-01     TRUE 31 days
18:  E 2017-03-01     TRUE 59 days
19:  E 2017-04-01     TRUE 90 days
20:  E 2017-05-01    FALSE  0 days
21:  E 2017-06-01     TRUE 31 days
    id       date with.90d      cd

Note that I have appended two more groups, D and E to OP's sample dataset in order to better verify the approach. Also note how the result differs for group D which starts on 2017-03-01 and E which starts on 2017-01-01.

Explanation

As long as there are NA values in with.90d, the following sequence is repeated for the NA rows only (rows with TRUE or FALSE values are completed):

  • Compute the day differences to the first date in each group. Note that min(date) is used which works also with an unordered dataset. Alternatively, setorder(dt3, date) and first(date) (or date[1]) can be used.
  • Rows with a day difference of 0 indicate the start of a new period and are flagged as FALSE.
  • Rows with a day difference of less or equal 90 days are flagged as TRUE.
  • All other rows remain untouched, i.e., they keep the NA value.

For illustration I have kept the helper column cd. It can be removed by dt3[, cd := NULL].

Data

# OP's sample dataset
dt <- data.table(id = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), 
                 date = as.Date(c("2017-01-01", "2017-02-01", "2017-05-01", "2017-01-01", "2017-05-01", "2017-10-01", "2017-01-01", "2017-02-01", "2017-02-15")))
# append group D
dt2 <- dt[, .(id = c(id, rep("D", 6)), 
              date = c(date, seq(as.Date("2017-03-01"), length.out = 6, by = "1 month")))]
# append group E
dt3 <- dt2[, .(id = c(id, rep("E", 6)), 
               date = c(date, seq(as.Date("2017-01-01"), length.out = 6, by = "1 month")))]

Upvotes: 1

Onyambu
Onyambu

Reputation: 79198

You can Also use base r functions:

transform(dt,with.90days=unlist(by(dt$date,dt$id,function(x)c(F,cumsum(as.numeric(diff(x)))<90))))
   id       date with.90days
1:  A 2017-01-01       FALSE
2:  A 2017-02-01        TRUE
3:  A 2017-05-01       FALSE
4:  B 2017-01-01       FALSE
5:  B 2017-05-01       FALSE
6:  B 2017-10-01       FALSE
7:  C 2017-01-01       FALSE
8:  C 2017-02-01        TRUE
9:  C 2017-02-15        TRUE

Upvotes: 1

Mike H.
Mike H.

Reputation: 14360

It sounds like you want to compare to all prior trades to make sure that the current trade is not within 90 days of ANY of them. To do this you could try:

dt[order(id, date), with.90d := sapply(1:(.N), function(i) all(difftime(date[i], date[1:(i-1)], units = "days") < 90) & i != 1L), by = id]

dt
#   id       date with.90d
#1:  A 2017-01-01    FALSE
#2:  A 2017-02-01     TRUE
#3:  A 2017-05-01    FALSE
#4:  B 2017-01-01    FALSE
#5:  B 2017-05-01    FALSE
#6:  B 2017-10-01    FALSE
#7:  C 2017-01-01    FALSE
#8:  C 2017-02-01     TRUE
#9:  C 2017-02-15     TRUE

What this is doing is taking the difference of the current date with all prior dates (within that group) and checking that all these differences are < 90 days. If any is >= 90 it will flag it with FALSE. Note that I use all() to return a logical, but you could use min() which would likely be faster.

Upvotes: 1

pogibas
pogibas

Reputation: 28319

You can use difftime:

# Data
library(data.table)
dt <- data.table(id = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), date = as.Date(c("2017-01-01", "2017-02-01", "2017-05-01", "2017-01-01", "2017-05-01", "2017-10-01", "2017-01-01", "2017-02-01", "2017-02-15")))

# Difference in days    
dt[, with.90d := as.numeric(difftime(date, shift(date), units = "days")) < 90, id]
dt[is.na(with.90d), with.90d := FALSE]

#    id       date with.90d
# 1:  A 2017-01-01    FALSE
# 2:  A 2017-02-01     TRUE
# 3:  A 2017-05-01     TRUE
# 4:  B 2017-01-01    FALSE
# 5:  B 2017-05-01    FALSE
# 6:  B 2017-10-01    FALSE
# 7:  C 2017-01-01    FALSE
# 8:  C 2017-02-01     TRUE
# 9:  C 2017-02-15     TRUE

Explanation:

  • Calculate difference in time using difftime(). Difference is between date and shifted date calculated by group (id).
  • Check if difference is less than 90 days.
  • Insert FALSE for first date per group (is.na())

Upvotes: 2

Related Questions