N08
N08

Reputation: 1315

Finding rows that follow each other consecutively based on year-column in dataframe

I have a dataframe containing three columns, where the first is an ID, the second denotes a year and the third column is the value associated with the ID in that year:

df.in <- data.frame("id"=c(1,   1,   1,   1,   1,   1,   1,   1,   2,   2,   2,   2,   2,   2,   2,   2,   2,   2),
                    "yr"=c(2005,2006,2007,2009,2010,2011,2012,2013,2006,2007,2008,2009,2010,2011,2012,2013,2015,2016),
                    "vl"=c(5,   6,   7,   8,   10,  1,   2,   3,   6,   8,   10,  1,   2,   3,   4,   5,   7,   9))

As you can see, there are some gaps between the years for a given ID. For ID==1 there are two "groups" of consecutive years having 3 and 5 rows, respectively. For ID==2 it is 8 and 2, respectively.

Is there a dplyr-way to produce a data.frame that contains only the group that has the largest number of rows per ID that follow each other consecutively? In other words I would like a dataframe that has 5+8 rows:

df.in <- data.frame("id"=c(1,   1,   1,   1,   1,   2,   2,   2,   2,   2,   2,   2,   2),
                    "yr"=c(2009,2010,2011,2012,2013,2006,2007,2008,2009,2010,2011,2012,2013),
                    "vl"=c(8,   10,  1,   2,   3,   6,   8,   10,  1,   2,   3,   4,   5))

EDIT: Just one more test-case:

df.in <- data.frame("id"=c(1,   1,   1,   1,   1,   1,   1,   1),
                    "yr"=c(2005,2006,2007,2008,2009,2010,2011,2012),
                    "vl"=c(5,   6,   7,   8,   10,  1,   2,   3))

should give an output identical to the input.

Upvotes: 1

Views: 61

Answers (3)

www
www

Reputation: 39154

Similar solution but with the use of complete and rleid function from tidyr and data.table package.

library(dplyr)
library(tidyr)
library(data.table)

df.out <- df.in %>%
  group_by(id) %>%
  complete(yr = full_seq(yr, period = 1)) %>%
  mutate(Group = rleid(is.na(vl))) %>%
  group_by(id, Group) %>%
  mutate(N = n()) %>%
  ungroup() %>%
  group_by(id) %>%
  filter(N == max(N)) %>%
  select(-Group, -N) %>%
  ungroup()

df.out
# # A tibble: 13 x 3
#      id    yr    vl
#    <dbl> <dbl> <dbl>
#  1  1.00  2009  8.00
#  2  1.00  2010 10.0 
#  3  1.00  2011  1.00
#  4  1.00  2012  2.00
#  5  1.00  2013  3.00
#  6  2.00  2006  6.00
#  7  2.00  2007  8.00
#  8  2.00  2008 10.0 
#  9  2.00  2009  1.00
# 10  2.00  2010  2.00
# 11  2.00  2011  3.00
# 12  2.00  2012  4.00
# 13  2.00  2013  5.00

Upvotes: 1

twedl
twedl

Reputation: 1648

dplyr answer

See where there's a gap in years (x), then use cumsum to group them, then count and filter for the maximum:

df.in %>%
  group_by(id) %>% 
  mutate(x = ifelse(is.na(lag(yr)), F, lag(yr) != yr - 1)) %>%
  mutate(y = cumsum(x)) %>%
  group_by(id, y) %>% add_count() %>%
  group_by(id) %>% filter(n == max(n)) %>%
  select(-x, -y, -n)

Upvotes: 1

moodymudskipper
moodymudskipper

Reputation: 47320

We add a flag when the date group is changing (after grouping by id), the cumsum of this flag gives us a group id.

We count the rows by these groups and keep only the max for each of your id value :

df.in %>%
  group_by(id) %>%
  mutate(group = cumsum(c(1,yr[1:(length(yr)-1)] < yr[2:length(yr)]-1))) %>%
  group_by(id,group) %>%
  add_count %>%
  group_by(id) %>%
  filter(n == max(n)) %>%
  select(-n) %>%
  ungroup


# # A tibble: 13 x 4
#          id    yr    vl group
#       <dbl> <dbl> <dbl> <dbl>
#     1     1  2009     8     2
#     2     1  2010    10     2
#     3     1  2011     1     2
#     4     1  2012     2     2
#     5     1  2013     3     2
#     6     2  2006     6     1
#     7     2  2007     8     1
#     8     2  2008    10     1
#     9     2  2009     1     1
#    10     2  2010     2     1
#    11     2  2011     3     1
#    12     2  2012     4     1
#    13     2  2013     5     1

Upvotes: 3

Related Questions