Reputation: 45
I have a data.table where I would like to flag entries that are within 90 days of a prior entry for a given group id. Context is that these are buy signals for trades. So I don't want duplicates within a 90 day window, since I am assuming I hold the position for 90 days, and thus would already have bought a position (and I don't want to restart the clock).
So I have:
library(data.table)
> dt <- data.table(id = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), date = as.Date(c("2017-01-01", "2017-02-01", "2017-05-01", "2017-01-01", "2017-05-01", "2017-10-01", "2017-01-01", "2017-02-01", "2017-02-15")))
> dt
id date
1: A 2017-01-01
2: A 2017-02-01
3: A 2017-05-01
4: B 2017-01-01
5: B 2017-05-01
6: B 2017-10-01
7: C 2017-01-01
8: C 2017-02-01
9: C 2017-02-15
And I want to get:
> dt2
id date with.90d
1: A 2017-01-01 FALSE
2: A 2017-02-01 TRUE
3: A 2017-05-01 FALSE
4: B 2017-01-01 FALSE
5: B 2017-05-01 FALSE
6: B 2017-10-01 FALSE
7: C 2017-01-01 FALSE
8: C 2017-02-01 TRUE
9: C 2017-02-15 TRUE
I feel like I should be able to do this with .SD, but I can't quite figure it out. Thanks for any help!
Upvotes: 2
Views: 2201
Reputation: 42544
The OP has requested:
From first observation X in each group, I want to flag any other observations less than 90 days away from X. Then for the next day observation that is greater than 90 days away from X, call it observation Y, I want to flag any observations within 90 days of Y. Repeat.
If I understood correctly from the expected result, a value of FALSE
in the with.90d
column indicates the beginn of 90 day period.
Unfortunately, the beginn of the next 90 day period depends on the date of the next observation after expiration of the previous 90 day period. Therefore, we cannot use fixed 90 day intervals starting with the first date in each group.
I have tried to find solutions using non-equi joins or rolling joins but I ended up so far with a recursive approach:
dt3[, with.90d := NA]
while (dt3[, any(is.na(with.90d))])
dt3[is.na(with.90d), cd := date - min(date), by = id][
is.na(with.90d) & cd == 0, with.90d := FALSE][
is.na(with.90d) & cd <= 90, with.90d := TRUE]
dt3
id date with.90d cd 1: A 2017-01-01 FALSE 0 days 2: A 2017-02-01 TRUE 31 days 3: A 2017-05-01 FALSE 0 days 4: B 2017-01-01 FALSE 0 days 5: B 2017-05-01 FALSE 0 days 6: B 2017-10-01 FALSE 0 days 7: C 2017-01-01 FALSE 0 days 8: C 2017-02-01 TRUE 31 days 9: C 2017-02-15 TRUE 45 days 10: D 2017-03-01 FALSE 0 days 11: D 2017-04-01 TRUE 31 days 12: D 2017-05-01 TRUE 61 days 13: D 2017-06-01 FALSE 0 days 14: D 2017-07-01 TRUE 30 days 15: D 2017-08-01 TRUE 61 days 16: E 2017-01-01 FALSE 0 days 17: E 2017-02-01 TRUE 31 days 18: E 2017-03-01 TRUE 59 days 19: E 2017-04-01 TRUE 90 days 20: E 2017-05-01 FALSE 0 days 21: E 2017-06-01 TRUE 31 days id date with.90d cd
Note that I have appended two more groups, D
and E
to OP's sample dataset in order to better verify the approach. Also note how the result differs for group D
which starts on 2017-03-01
and E
which starts on 2017-01-01
.
As long as there are NA
values in with.90d
, the following sequence is repeated for the NA
rows only (rows with TRUE
or FALSE
values are completed):
min(date)
is used which works also with an unordered dataset. Alternatively, setorder(dt3, date)
and first(date)
(or date[1]
) can be used.0
indicate the start of a new period and are flagged as FALSE
.TRUE
.NA
value.For illustration I have kept the helper column cd
. It can be removed by dt3[, cd := NULL]
.
# OP's sample dataset
dt <- data.table(id = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
date = as.Date(c("2017-01-01", "2017-02-01", "2017-05-01", "2017-01-01", "2017-05-01", "2017-10-01", "2017-01-01", "2017-02-01", "2017-02-15")))
# append group D
dt2 <- dt[, .(id = c(id, rep("D", 6)),
date = c(date, seq(as.Date("2017-03-01"), length.out = 6, by = "1 month")))]
# append group E
dt3 <- dt2[, .(id = c(id, rep("E", 6)),
date = c(date, seq(as.Date("2017-01-01"), length.out = 6, by = "1 month")))]
Upvotes: 1
Reputation: 79198
You can Also use base r functions:
transform(dt,with.90days=unlist(by(dt$date,dt$id,function(x)c(F,cumsum(as.numeric(diff(x)))<90))))
id date with.90days
1: A 2017-01-01 FALSE
2: A 2017-02-01 TRUE
3: A 2017-05-01 FALSE
4: B 2017-01-01 FALSE
5: B 2017-05-01 FALSE
6: B 2017-10-01 FALSE
7: C 2017-01-01 FALSE
8: C 2017-02-01 TRUE
9: C 2017-02-15 TRUE
Upvotes: 1
Reputation: 14360
It sounds like you want to compare to all prior trades to make sure that the current trade is not within 90 days of ANY of them. To do this you could try:
dt[order(id, date), with.90d := sapply(1:(.N), function(i) all(difftime(date[i], date[1:(i-1)], units = "days") < 90) & i != 1L), by = id]
dt
# id date with.90d
#1: A 2017-01-01 FALSE
#2: A 2017-02-01 TRUE
#3: A 2017-05-01 FALSE
#4: B 2017-01-01 FALSE
#5: B 2017-05-01 FALSE
#6: B 2017-10-01 FALSE
#7: C 2017-01-01 FALSE
#8: C 2017-02-01 TRUE
#9: C 2017-02-15 TRUE
What this is doing is taking the difference of the current date with all prior dates (within that group) and checking that all these differences are < 90 days. If any is >= 90 it will flag it with FALSE
. Note that I use all()
to return a logical, but you could use min()
which would likely be faster.
Upvotes: 1
Reputation: 28319
You can use difftime
:
# Data
library(data.table)
dt <- data.table(id = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), date = as.Date(c("2017-01-01", "2017-02-01", "2017-05-01", "2017-01-01", "2017-05-01", "2017-10-01", "2017-01-01", "2017-02-01", "2017-02-15")))
# Difference in days
dt[, with.90d := as.numeric(difftime(date, shift(date), units = "days")) < 90, id]
dt[is.na(with.90d), with.90d := FALSE]
# id date with.90d
# 1: A 2017-01-01 FALSE
# 2: A 2017-02-01 TRUE
# 3: A 2017-05-01 TRUE
# 4: B 2017-01-01 FALSE
# 5: B 2017-05-01 FALSE
# 6: B 2017-10-01 FALSE
# 7: C 2017-01-01 FALSE
# 8: C 2017-02-01 TRUE
# 9: C 2017-02-15 TRUE
Explanation:
difftime()
. Difference is between date and shifted date calculated by group (id
).FALSE
for first date per group (is.na()
)Upvotes: 2