Reputation: 11
I have a dataset with column names ID, date, coverage. Each ID has a different number of dates. Coverage is an integer 0-3. I want to filter this dataset so that coverage==3 for the earliest and last time point only. Example input:
ID date coverage
001 2012-12-24 2
001 2013-12-04 3
001 2014-12-14 1
001 2015-12-02 3
001 2016-12-02 0
002 2012-01-15 3
002 2013-11-15 1
002 2014-11-15 3
003 2019-01-15 1
003 2020-11-15 1
003 2021-11-15 3
Example output:
ID date coverage
001 2013-12-04 3
001 2014-12-14 1
001 2015-12-02 3
002 2012-01-15 3
002 2013-11-15 1
002 2014-11-15 3
003 2021-11-15 3
Upvotes: 1
Views: 66
Reputation: 886938
We arrange
the 'ID', 'date', grouped by 'ID', slice
the rows from the first coverage of 3 value to the last
. Note that if there are no 3 values in coverage, we may need a condition to either drop the IDs by having an if/else
condition with else
returning NULL
(or if we want the full data rows for those IDs, use row_number()
)
library(dplyr)
df1 %>%
arrange(ID, date) %>%
group_by(ID) %>%
slice(if(3 %in% coverage)
match(3, coverage):last(which(coverage == 3)) else NULL) %>%
# if we want to keep the full rows
# slice(if(3 %in% coverage)
# match(3, coverage):last(which(coverage == 3)) else row_number()) %>%
ungroup
-output
# A tibble: 7 × 3
ID date coverage
<int> <date> <int>
1 1 2013-12-04 3
2 1 2014-12-14 1
3 1 2015-12-02 3
4 2 2012-01-15 3
5 2 2013-11-15 1
6 2 2014-11-15 3
7 3 2021-11-15 3
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L), date = structure(c(15698, 16043, 16418, 16771, 17137, 15354,
16024, 16389, 17911, 18581, 18946), class = "Date"), coverage = c(2L,
3L, 1L, 3L, 0L, 3L, 1L, 3L, 1L, 1L, 3L)), row.names = c(NA, -11L
), class = "data.frame")
Upvotes: 0