user2363777
user2363777

Reputation: 1071

How can I keep only those rows that together contain the longest consecutive run of a variable increasing by one, using dplyr in R?

I have a tibble where each row contains a subject identifier and a year. My goal is to isolate, for each subject, only those rows which together constitute the longest sequence of rows in which the variable year increases by 1 from one row to the next.

I've tried quite a few things with a grouped filter, such as building helper variables that code whether year on one row is one more or less than year on the previous row, and using the rle() function. But so far nothing has worked exactly as it should.

Here is a toy example of my data. Note that the number of rows varies between subjects and that there are typically (some) gaps between years. Also note that the data have been arranged so that the year value always increases from one row to the next within each subject.

# A tibble: 8 x 2
  subject  year
    <dbl> <dbl>
1       1  2012
2       1  2013
3       1  2015
4       1  2016
5       1  2017
6       1  2019
7       2  2011
8       2  2013

The toy example tibble can be recreated by running this code:

dat = structure(list(subject = c(1, 1, 1, 1, 1, 1, 2, 2), year = c(2012, 
2013, 2015, 2016, 2017, 2019, 2011, 2013)), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))

To clarify, for this tibble the desired output is:

# A tibble: 3 x 2
  subject  year
    <dbl> <dbl>
1       1  2015
2       1  2016
3       1  2017

(Note that subject 2 is dropped because she has no sequence of years increasing by one.)

There must be an elegant way to do this using dplyr!

Upvotes: 2

Views: 208

Answers (1)

r2evans
r2evans

Reputation: 160437

This doesn't take into account ties, but ...

dat %>%
  group_by(subject) %>%
  mutate( r = cumsum(c(TRUE, diff(year) != 1)) ) %>%
  group_by(subject, r) %>%
  mutate( rcount = n() ) %>%
  group_by(subject) %>%
  filter(rcount > 1, rcount == max(rcount)) %>%
  select(-r, -rcount) %>%
  ungroup()
# # A tibble: 3 x 2
#   subject  year
#     <dbl> <dbl>
# 1       1  2015
# 2       1  2016
# 3       1  2017

Upvotes: 1

Related Questions