Reputation: 607
Within my full data set, there are repeated surveys in the same areas. My problem is that I'm trying to select those for which they have been surveyed in each year that surveys were performed. One other situation involves those that have been surveyed at least 3 out of the 4 times.
I've produced the following example data set in order to highlight what I'm trying to do:
id area_id year
1 1 2010
2 1 2011
3 1 2012
4 1 2013
5 2 2010
6 2 2011
7 2 2013
8 3 2010
9 3 2012
10 4 2012
With the following code to easily create it:
ex_df <- data.frame(id = 1:10,
area_id = c(rep(1, 4), rep(2, 3), rep(3, 2), 4),
year = c(2010:2013, 2010, 2011, 2013, 2010, 2012, 2012))
My simplistic understanding of dplyr
has me doing
ex_df %>% group_by(area_id, year) %>% ???
so far but the ??? represents where I'm lost as to how to filter this to return a vector of the IDs relating to my requirements.
For the first problem, I would want to filter to give the vector (1, 2, 3, 4)
as area_id 1
was the only one surveyed in all 4 years.
For the second problem, I would want the vector to be (1, 2, 3, 4, 5, 6, 7)
.
What would be the appropriate way to manage this?
Upvotes: 1
Views: 35
Reputation: 160407
How about this?
required_years <- 2010:2013
ex_df2 <- ex_df %>%
group_by(area_id) %>%
mutate(
fouryears = all(required_years %in% year),
threeyears = sum(required_years %in% year) >= 3
)
ex_df2
# Source: local data frame [10 x 5]
# Groups: area_id [4]
# id area_id year fouryears threeyears
# <int> <dbl> <dbl> <lgl> <lgl>
# 1 1 1 2010 TRUE TRUE
# 2 2 1 2011 TRUE TRUE
# 3 3 1 2012 TRUE TRUE
# 4 4 1 2013 TRUE TRUE
# 5 5 2 2010 FALSE TRUE
# 6 6 2 2011 FALSE TRUE
# 7 7 2 2013 FALSE TRUE
# 8 8 3 2010 FALSE FALSE
# 9 9 3 2012 FALSE FALSE
# 10 10 4 2012 FALSE FALSE
(where you can pull
to grab what you need)
filter(ex_df2, fouryears) %>% pull(id)
# [1] 1 2 3 4
filter(ex_df2, threeyears) %>% pull(id)
# [1] 1 2 3 4 5 6 7
Upvotes: 1
Reputation: 50668
For case 1:
ex_df %>%
group_by(area_id) %>%
filter(n() == 4) %>%
pull(id)
#[1] 1 2 3 4
For case 2:
ex_df %>%
group_by(area_id) %>%
filter(n() >= 3) %>%
pull(id)
#[1] 1 2 3 4 5 6 7
Upvotes: 2