Reputation: 13
I have a survey of households in various cities in every state. Some cities only have a few respondents, so I want to drop the cities with 5 or fewer. I tried the code below, but there are some cities with the same name in multiple states (Paris, Idaho has only 2 respondents but Paris, Texas has 13).
How can I filter out Paris, Idaho but not Paris, Texas?
city_tally <- scores %>%
group_by(state, city) %>%
tally()
enough_samples <- city_tally %>%
filter(n>5) %>%
select(state, city, n)
scores <- scores %>%
group_by(state) %>%
filter(city %in% enough_samples$city)
Upvotes: 1
Views: 110
Reputation: 886948
One option where we can do this in a chain would be - after grouping by 'state', 'city', create the frequency column ('n') with mutate
, then by grouping with 'state' do the filter
based on the 'n'
library(dplyr)
scores %>%
group_by(state, city) %>%
mutate(n = n()) %>%
group_by(state) %>%
filter(n > 5) %>%
select(-n) # if it is not required to have the 'n' column
-output (based on @Brandon's reproducible example
# A tibble: 13 x 3
# Groups: state [2]
# city state scores
# <fctr> <fctr> <dbl>
# 1 Paris Texas 4.73
# 2 Paris Texas 0.657
# 3 Paris Texas 5.32
# 4 Paris Texas 0.718
# 5 Paris Texas 6.95
# 6 Paris Texas 6.30
# 7 Yew Maryland -3.96
# 8 Yew Maryland 6.48
# 9 Yew Maryland 3.78
#10 Yew Maryland 3.38
#11 Yew Maryland -1.88
#12 Yew Maryland 2.09
#13 Yew Maryland 5.67
Upvotes: 0
Reputation: 1904
Instead of the last group 3 lines you have scores <- scores %>% etc...
, try this below:
left_join(scores, enough_samples, by = c('state', 'city')) %>%
filter(!is.na(n))
This uses left_join
and you can merge using multiple identifiers, in this case both state and city.
For a demonstration of how this works, for others who may come, see the example code and output below. First I make some fake data.
> set.seed(1)
> scores <- data.frame(city = rep(c('Paris','London','York','Los','Yew'), c(10, 5, 5, 3, 7)),
state = c(rep(c('Idaho','Texas','Texas'), 4), rep('Cali', 5), rep('Oregon', 4), rep('Maryland', 9)),
scores = rnorm(n = 30, mean = 4, sd = 4))
> scores
city state scores
1 Paris Idaho 1.4941848
2 Paris Texas 4.7345733
3 Paris Texas 0.6574856
4 Paris Idaho 10.3811232
5 Paris Texas 5.3180311
6 Paris Texas 0.7181265
7 Paris Idaho 5.9497162
8 Paris Texas 6.9532988
9 Paris Texas 6.3031254
10 Paris Idaho 2.7784465
11 London Texas 10.0471247
12 London Texas 5.5593729
13 London Cali 1.5150377
14 London Cali -4.8587995
15 London Cali 8.4997237
16 York Cali 3.8202656
17 York Cali 3.9352389
18 York Oregon 7.7753448
19 York Oregon 7.2848848
20 York Oregon 6.3756053
21 Los Oregon 7.6759095
22 Los Maryland 7.1285452
23 Los Maryland 4.2982599
24 Yew Maryland -3.9574068
25 Yew Maryland 6.4793030
26 Yew Maryland 3.7754850
27 Yew Maryland 3.3768180
28 Yew Maryland -1.8830095
29 Yew Maryland 2.0873998
30 Yew Maryland 5.6717662
Next we use the code you provided to create city_tally
and enough_samples
> city_tally <- scores %>%
group_by(state, city) %>%
tally()
> city_tally
# A tibble: 9 x 3
# Groups: state [5]
state city n
<fct> <fct> <int>
1 Cali London 3
2 Cali York 2
3 Idaho Paris 4
4 Maryland Los 2
5 Maryland Yew 7
6 Oregon Los 1
7 Oregon York 3
8 Texas London 2
9 Texas Paris 6
> enough_samples <- city_tally %>%
filter(n>5) %>%
select(state, city, n)
> enough_samples
# A tibble: 2 x 3
# Groups: state [2]
state city n
<fct> <fct> <int>
1 Maryland Yew 7
2 Texas Paris 6
Finally we use left_join
to merge these values back into the original scores table. These can then be used to filter the scores tables.
> left_join(scores, enough_samples, by = c('state', 'city')) %>%
filter(!is.na(n))
city state scores n
1 Paris Texas 4.7345733 6
2 Paris Texas 0.6574856 6
3 Paris Texas 5.3180311 6
4 Paris Texas 0.7181265 6
5 Paris Texas 6.9532988 6
6 Paris Texas 6.3031254 6
7 Yew Maryland -3.9574068 7
8 Yew Maryland 6.4793030 7
9 Yew Maryland 3.7754850 7
10 Yew Maryland 3.3768180 7
11 Yew Maryland -1.8830095 7
12 Yew Maryland 2.0873998 7
13 Yew Maryland 5.6717662 7
Upvotes: 2