bm4n4d
bm4n4d

Reputation: 13

how to filter by a combination of two conditions

I have a survey of households in various cities in every state. Some cities only have a few respondents, so I want to drop the cities with 5 or fewer. I tried the code below, but there are some cities with the same name in multiple states (Paris, Idaho has only 2 respondents but Paris, Texas has 13).

How can I filter out Paris, Idaho but not Paris, Texas?

city_tally <- scores %>%
group_by(state, city) %>%
tally()

enough_samples <- city_tally %>%
filter(n>5) %>%
select(state, city, n)

scores <- scores %>%
group_by(state) %>%
filter(city %in% enough_samples$city)

Upvotes: 1

Views: 110

Answers (2)

akrun
akrun

Reputation: 886948

One option where we can do this in a chain would be - after grouping by 'state', 'city', create the frequency column ('n') with mutate, then by grouping with 'state' do the filter based on the 'n'

library(dplyr)
scores %>% 
   group_by(state, city) %>%
   mutate(n = n()) %>%
   group_by(state) %>% 
   filter(n > 5) %>%
   select(-n) # if it is not required to have the 'n' column

-output (based on @Brandon's reproducible example

# A tibble: 13 x 3
# Groups: state [2]
#   city   state    scores
#   <fctr> <fctr>    <dbl>
# 1 Paris  Texas     4.73 
# 2 Paris  Texas     0.657
# 3 Paris  Texas     5.32 
# 4 Paris  Texas     0.718
# 5 Paris  Texas     6.95 
# 6 Paris  Texas     6.30 
# 7 Yew    Maryland -3.96 
# 8 Yew    Maryland  6.48 
# 9 Yew    Maryland  3.78 
#10 Yew    Maryland  3.38 
#11 Yew    Maryland -1.88 
#12 Yew    Maryland  2.09 
#13 Yew    Maryland  5.67 

Upvotes: 0

Brandon
Brandon

Reputation: 1904

Instead of the last group 3 lines you have scores <- scores %>% etc..., try this below:

left_join(scores, enough_samples, by = c('state', 'city')) %>% 
  filter(!is.na(n))

This uses left_join and you can merge using multiple identifiers, in this case both state and city.

For a demonstration of how this works, for others who may come, see the example code and output below. First I make some fake data.

> set.seed(1)
> scores <- data.frame(city = rep(c('Paris','London','York','Los','Yew'), c(10, 5, 5, 3, 7)),
                      state = c(rep(c('Idaho','Texas','Texas'), 4), rep('Cali', 5), rep('Oregon', 4), rep('Maryland', 9)),
                      scores = rnorm(n = 30, mean = 4, sd = 4))

> scores
     city    state     scores
1   Paris    Idaho  1.4941848
2   Paris    Texas  4.7345733
3   Paris    Texas  0.6574856
4   Paris    Idaho 10.3811232
5   Paris    Texas  5.3180311
6   Paris    Texas  0.7181265
7   Paris    Idaho  5.9497162
8   Paris    Texas  6.9532988
9   Paris    Texas  6.3031254
10  Paris    Idaho  2.7784465
11 London    Texas 10.0471247
12 London    Texas  5.5593729
13 London     Cali  1.5150377
14 London     Cali -4.8587995
15 London     Cali  8.4997237
16   York     Cali  3.8202656
17   York     Cali  3.9352389
18   York   Oregon  7.7753448
19   York   Oregon  7.2848848
20   York   Oregon  6.3756053
21    Los   Oregon  7.6759095
22    Los Maryland  7.1285452
23    Los Maryland  4.2982599
24    Yew Maryland -3.9574068
25    Yew Maryland  6.4793030
26    Yew Maryland  3.7754850
27    Yew Maryland  3.3768180
28    Yew Maryland -1.8830095
29    Yew Maryland  2.0873998
30    Yew Maryland  5.6717662

Next we use the code you provided to create city_tally and enough_samples

> city_tally <- scores %>%
   group_by(state, city) %>%
   tally()

> city_tally
# A tibble: 9 x 3
# Groups:   state [5]
  state    city       n
  <fct>    <fct>  <int>
1 Cali     London     3
2 Cali     York       2
3 Idaho    Paris      4
4 Maryland Los        2
5 Maryland Yew        7
6 Oregon   Los        1
7 Oregon   York       3
8 Texas    London     2
9 Texas    Paris      6

> enough_samples <- city_tally %>%
   filter(n>5) %>%
   select(state, city, n)

> enough_samples
# A tibble: 2 x 3
# Groups:   state [2]
  state    city      n
  <fct>    <fct> <int>
1 Maryland Yew       7
2 Texas    Paris     6

Finally we use left_join to merge these values back into the original scores table. These can then be used to filter the scores tables.

> left_join(scores, enough_samples, by = c('state', 'city')) %>% 
  filter(!is.na(n))
    city    state     scores n
1  Paris    Texas  4.7345733 6
2  Paris    Texas  0.6574856 6
3  Paris    Texas  5.3180311 6
4  Paris    Texas  0.7181265 6
5  Paris    Texas  6.9532988 6
6  Paris    Texas  6.3031254 6
7    Yew Maryland -3.9574068 7
8    Yew Maryland  6.4793030 7
9    Yew Maryland  3.7754850 7
10   Yew Maryland  3.3768180 7
11   Yew Maryland -1.8830095 7
12   Yew Maryland  2.0873998 7
13   Yew Maryland  5.6717662 7

Upvotes: 2

Related Questions