Reputation: 1315
I have a dataframe containing three columns, where the first is an ID, the second denotes a year and the third column is the value associated with the ID in that year:
df.in <- data.frame("id"=c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
"yr"=c(2005,2006,2007,2009,2010,2011,2012,2013,2006,2007,2008,2009,2010,2011,2012,2013,2015,2016),
"vl"=c(5, 6, 7, 8, 10, 1, 2, 3, 6, 8, 10, 1, 2, 3, 4, 5, 7, 9))
As you can see, there are some gaps between the years for a given ID. For ID==1
there are two "groups" of consecutive years having 3 and 5 rows, respectively. For ID==2
it is 8 and 2, respectively.
Is there a dplyr
-way to produce a data.frame that contains only the group that has the largest number of rows per ID that follow each other consecutively? In other words I would like a dataframe that has 5+8
rows:
df.in <- data.frame("id"=c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
"yr"=c(2009,2010,2011,2012,2013,2006,2007,2008,2009,2010,2011,2012,2013),
"vl"=c(8, 10, 1, 2, 3, 6, 8, 10, 1, 2, 3, 4, 5))
EDIT: Just one more test-case:
df.in <- data.frame("id"=c(1, 1, 1, 1, 1, 1, 1, 1),
"yr"=c(2005,2006,2007,2008,2009,2010,2011,2012),
"vl"=c(5, 6, 7, 8, 10, 1, 2, 3))
should give an output identical to the input.
Upvotes: 1
Views: 61
Reputation: 39154
Similar solution but with the use of complete
and rleid
function from tidyr
and data.table
package.
library(dplyr)
library(tidyr)
library(data.table)
df.out <- df.in %>%
group_by(id) %>%
complete(yr = full_seq(yr, period = 1)) %>%
mutate(Group = rleid(is.na(vl))) %>%
group_by(id, Group) %>%
mutate(N = n()) %>%
ungroup() %>%
group_by(id) %>%
filter(N == max(N)) %>%
select(-Group, -N) %>%
ungroup()
df.out
# # A tibble: 13 x 3
# id yr vl
# <dbl> <dbl> <dbl>
# 1 1.00 2009 8.00
# 2 1.00 2010 10.0
# 3 1.00 2011 1.00
# 4 1.00 2012 2.00
# 5 1.00 2013 3.00
# 6 2.00 2006 6.00
# 7 2.00 2007 8.00
# 8 2.00 2008 10.0
# 9 2.00 2009 1.00
# 10 2.00 2010 2.00
# 11 2.00 2011 3.00
# 12 2.00 2012 4.00
# 13 2.00 2013 5.00
Upvotes: 1
Reputation: 1648
See where there's a gap in years (x
), then use cumsum
to group them, then count and filter for the maximum:
df.in %>%
group_by(id) %>%
mutate(x = ifelse(is.na(lag(yr)), F, lag(yr) != yr - 1)) %>%
mutate(y = cumsum(x)) %>%
group_by(id, y) %>% add_count() %>%
group_by(id) %>% filter(n == max(n)) %>%
select(-x, -y, -n)
Upvotes: 1
Reputation: 47320
We add a flag when the date group is changing (after grouping by id), the cumsum of this flag gives us a group id.
We count the rows by these groups and keep only the max for each of your id
value :
df.in %>%
group_by(id) %>%
mutate(group = cumsum(c(1,yr[1:(length(yr)-1)] < yr[2:length(yr)]-1))) %>%
group_by(id,group) %>%
add_count %>%
group_by(id) %>%
filter(n == max(n)) %>%
select(-n) %>%
ungroup
# # A tibble: 13 x 4
# id yr vl group
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2009 8 2
# 2 1 2010 10 2
# 3 1 2011 1 2
# 4 1 2012 2 2
# 5 1 2013 3 2
# 6 2 2006 6 1
# 7 2 2007 8 1
# 8 2 2008 10 1
# 9 2 2009 1 1
# 10 2 2010 2 1
# 11 2 2011 3 1
# 12 2 2012 4 1
# 13 2 2013 5 1
Upvotes: 3