Syzorr
Syzorr

Reputation: 607

Filtering Using the Results of Grouping

Within my full data set, there are repeated surveys in the same areas. My problem is that I'm trying to select those for which they have been surveyed in each year that surveys were performed. One other situation involves those that have been surveyed at least 3 out of the 4 times.

I've produced the following example data set in order to highlight what I'm trying to do:

id   area_id   year
 1         1   2010
 2         1   2011
 3         1   2012
 4         1   2013
 5         2   2010
 6         2   2011
 7         2   2013
 8         3   2010
 9         3   2012
10         4   2012

With the following code to easily create it:

ex_df <- data.frame(id = 1:10,
                    area_id = c(rep(1, 4), rep(2, 3), rep(3, 2), 4),
                    year = c(2010:2013, 2010, 2011, 2013, 2010, 2012, 2012))

My simplistic understanding of dplyr has me doing

ex_df %>% group_by(area_id, year) %>% ???

so far but the ??? represents where I'm lost as to how to filter this to return a vector of the IDs relating to my requirements.

For the first problem, I would want to filter to give the vector (1, 2, 3, 4) as area_id 1 was the only one surveyed in all 4 years.

For the second problem, I would want the vector to be (1, 2, 3, 4, 5, 6, 7).

What would be the appropriate way to manage this?

Upvotes: 1

Views: 35

Answers (2)

r2evans
r2evans

Reputation: 160407

How about this?

required_years <- 2010:2013

ex_df2 <- ex_df %>%
  group_by(area_id) %>%
  mutate(
    fouryears = all(required_years %in% year),
    threeyears = sum(required_years %in% year) >= 3
  )
ex_df2
# Source: local data frame [10 x 5]
# Groups: area_id [4]
#       id area_id  year fouryears threeyears
#    <int>   <dbl> <dbl>     <lgl>      <lgl>
# 1      1       1  2010      TRUE       TRUE
# 2      2       1  2011      TRUE       TRUE
# 3      3       1  2012      TRUE       TRUE
# 4      4       1  2013      TRUE       TRUE
# 5      5       2  2010     FALSE       TRUE
# 6      6       2  2011     FALSE       TRUE
# 7      7       2  2013     FALSE       TRUE
# 8      8       3  2010     FALSE      FALSE
# 9      9       3  2012     FALSE      FALSE
# 10    10       4  2012     FALSE      FALSE

(where you can pull to grab what you need)

filter(ex_df2, fouryears) %>% pull(id)
# [1] 1 2 3 4
filter(ex_df2, threeyears) %>% pull(id)
# [1] 1 2 3 4 5 6 7

Upvotes: 1

Maurits Evers
Maurits Evers

Reputation: 50668

For case 1:

ex_df %>%
    group_by(area_id) %>%
    filter(n() == 4) %>%
    pull(id)
#[1] 1 2 3 4    

For case 2:

ex_df %>%
    group_by(area_id) %>%
    filter(n() >= 3) %>%
    pull(id)
#[1] 1 2 3 4 5 6 7

Upvotes: 2

Related Questions