Reputation: 3973
I have the following data.table:
require(data.table)
dt = data.table(
id = c(rep('Grp 1', 31), rep('Grp 2', 31)),
date = rep(as.IDate(as.IDate('2020-01-01') : as.IDate('2020-01-31')), 2),
change = c(rep(NA, 5), rep('yes', 5), rep(NA, 10), rep('yes', 3), rep(NA, 8),
rep(NA, 2), rep('yes', 8), rep(NA, 8), rep('yes', 5), rep(NA, 8))
)
For every group id
I'd like to filter the first and last date
s of a series, which is defined by a second column change
being yes
(i.e. non-NA). I can do the following, which would provide me with the first and last non-NA row by group. However, the problem is that the series occurs more than once per group.
dt[ !is.na(change),
.(head(date, 1),
tail(date, 1)),
.(id) ]
These are the row indices I'd like to have filtered:
dt[c(6,10,21,23,34,41,50,54)]
Upvotes: 0
Views: 94
Reputation: 8844
One way is to give a unique group id to each streak identified by an id
and change
combination. We can use rleid
to generate such run-length type ids. Consider something like this
dt[,
gid := rleid(id, change)
][!is.na(change),
as.list(range(date)),
by = .(id, gid)
][,
gid := NULL
]
Note that I also assume that you want the range of dates, not really the first and last elements. Your method will fail if the dates are not in chronological order. Output looks like this
id V1 V2
1: Grp 1 2020-01-06 2020-01-10
2: Grp 1 2020-01-21 2020-01-23
3: Grp 2 2020-01-03 2020-01-10
4: Grp 2 2020-01-19 2020-01-23
rleid
works like this
> rleid(c(1, 1, 2, 3, 3), c("a", "b", "b", "d", "d"))
[1] 1 2 3 4 4
Upvotes: 1
Reputation: 886938
Here is an option with dplyr
library(dplyr)
library(data.table)
dt %>%
group_by(grp = rleid(id, change), id) %>%
filter(!is.na(change)) %>%
summarise(V1 = min(date, na.rm = TRUE),
V2 = max(date, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 4 x 4
# grp id V1 V2
# <int> <chr> <date> <date>
#1 2 Grp 1 2020-01-06 2020-01-10
#2 4 Grp 1 2020-01-21 2020-01-23
#3 7 Grp 2 2020-01-03 2020-01-10
#4 9 Grp 2 2020-01-19 2020-01-23
Upvotes: 1