Reputation: 11
I have a table of mostly categorical values and want to only keep rows that have the most common values in a particular column. I'm trying to use slice_max() but it's not working as I expect. I did see older suggestions for how to do this in base R or using the deprecated top_n(), but the top_n() documentation says to use slice_max instead and I can't find much detail about how slice_max works.
I'll use the starwars dataset as my example. The two most common homeworlds are Naboo, with 11 occurrences, and Tatooine, with 10. So I want the code to say "show me all the rows with the two most common homeworlds", and I expect that to give me a 21 row tibble where the homeworlds are all Naboo and Tatooine.
I added a column I called "worldcount" that simply counts the occurrences of the homeworld so I can easily see how many times each homeworld occurs. I also only selected a few columns to simplify things:
starwars %>%
select(name, sex, homeworld, species) %>%
filter(!is.na(homeworld)) %>%
add_count(homeworld, name="worldcount") %>%
slice_max(worldcount, n=2)
# A tibble: 11 × 5
name sex homeworld species worldcount
<chr> <chr> <chr> <chr> <int>
1 R2-D2 none Naboo Droid 11
2 Palpatine male Naboo Human 11
3 Jar Jar Binks male Naboo Gungan 11
4 Roos Tarpals male Naboo Gungan 11
5 Rugor Nass male Naboo Gungan 11
6 Ric Olié NA Naboo NA 11
7 Quarsh Panaka NA Naboo NA 11
8 Gregar Typho male Naboo Human 11
9 Cordé female Naboo Human 11
10 Dormé female Naboo Human 11
11 Padmé Amidala female Naboo Human 11
But this code ONLY returns the rows where Naboo is the homeworld. When I set n=2 in slice_max() I expect the top 2 homeworlds - but Tatoine isn't here?
I also tried using slice_max() directly on a column with categorical data, but I think this might be calculating the "max" based on alphabetical order, since it returns the two homeworlds starting with letters at the end of the alphabet:
starwars %>%
select(name, sex, homeworld, species) %>%
filter(!is.na(homeworld)) %>%
slice_max(homeworld, n=2)
A tibble: 2 × 4
name sex homeworld species
<chr> <chr> <chr> <chr>
1 Zam Wesell female Zolan Clawdite
2 Dud Bolt male Vulpter Vulptereen
Finally I tried using slice_max on numeric data that's already in the starwars dataset, but this also doesn't work the way I would expect.
If I ask for the 8 top heights, I get what I expect: 9 rows, because two characters in starwars have the same height:
starwars %>%
select(name, height) %>%
slice_max(height, n=8)
# A tibble: 9 × 2
name height
<chr> <int>
1 Yarael Poof 264
2 Tarfful 234
3 Lama Su 229
4 Chewbacca 228
5 Roos Tarpals 224
6 Grievous 216
7 Taun We 213
8 Rugor Nass 206
9 Tion Medon 206
So if I set n=9 and ask for the top 9 heights, I should get rows for 10 different characters, right? But no - this produces exactly the same result:
starwars %>%
select(name, height) %>%
slice_max(height, n=9)
# A tibble: 9 × 2
name height
<chr> <int>
1 Yarael Poof 264
2 Tarfful 234
3 Lama Su 229
4 Chewbacca 228
5 Roos Tarpals 224
6 Grievous 216
7 Taun We 213
8 Rugor Nass 206
9 Tion Medon 206
So have I misunderstood how slice_max works?
Or is there a different way I can get to just the rows with the two most common homeworlds?
Upvotes: 1
Views: 2343
Reputation: 7106
Another possible approach:
library(tidyverse)
sw <- starwars %>%
select(name, sex, homeworld, species) %>%
filter(!is.na(homeworld)) %>%
add_count(homeworld)
counts <- unique(sw$n) %>% sort(decreasing = TRUE)
sw %>%
filter(n %in% counts[1:2])
#> # A tibble: 21 × 5
#> name sex homeworld species n
#> <chr> <chr> <chr> <chr> <int>
#> 1 Luke Skywalker male Tatooine Human 10
#> 2 C-3PO none Tatooine Droid 10
#> 3 R2-D2 none Naboo Droid 11
#> 4 Darth Vader male Tatooine Human 10
#> 5 Owen Lars male Tatooine Human 10
#> 6 Beru Whitesun lars female Tatooine Human 10
#> 7 R5-D4 none Tatooine Droid 10
#> 8 Biggs Darklighter male Tatooine Human 10
#> 9 Anakin Skywalker male Tatooine Human 10
#> 10 Palpatine male Naboo Human 11
#> # … with 11 more rows
Created on 2022-01-23 by the reprex package (v2.0.1)
Upvotes: 0
Reputation: 160407
slice_max
is going to give you a maximum number of rows, not necessarily number of unique homeworld
s. Try this:
out <- starwars %>%
filter(
homeworld %in% head(names(sort(table(homeworld), decreasing = TRUE)), 10)
)
out
# # A tibble: 39 x 14
# name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <list> <list>
# 1 Luke Skywalker 172 77 blond fair blue 19 male mascu~ Tatooine Human <chr~ <chr [2~ <chr [2]>
# 2 C-3PO 167 75 NA gold yellow 112 none mascu~ Tatooine Droid <chr~ <chr [0~ <chr [0]>
# 3 R2-D2 96 32 NA white, bl~ red 33 none mascu~ Naboo Droid <chr~ <chr [0~ <chr [0]>
# 4 Darth Vader 202 136 none white yellow 41.9 male mascu~ Tatooine Human <chr~ <chr [0~ <chr [1]>
# 5 Leia Organa 150 49 brown light brown 19 fema~ femin~ Alderaan Human <chr~ <chr [1~ <chr [0]>
# 6 Owen Lars 178 120 brown, gr~ light blue 52 male mascu~ Tatooine Human <chr~ <chr [0~ <chr [0]>
# 7 Beru Whitesun lars 165 75 brown light blue 47 fema~ femin~ Tatooine Human <chr~ <chr [0~ <chr [0]>
# 8 R5-D4 97 32 NA white, red red NA none mascu~ Tatooine Droid <chr~ <chr [0~ <chr [0]>
# 9 Biggs Darklighter 183 84 black light brown 24 male mascu~ Tatooine Human <chr~ <chr [0~ <chr [1]>
# 10 Anakin Skywalker 188 84 blond fair blue 41.9 male mascu~ Tatooine Human <chr~ <chr [2~ <chr [3]>
# # ... with 29 more rows
table(out$homeworld)
# Alderaan Aleen Minor Corellia Coruscant Kamino Kashyyyk Mirial Naboo Ryloth Tatooine
# 3 1 2 3 3 2 2 11 2 10
subset(starwars,
homeworld %in% head(names(sort(table(homeworld), decreasing=TRUE)), 10))
Upvotes: 0
Reputation: 66415
starwars %>%
count(homeworld, sort = TRUE) %>%
slice(1:2) %>%
left_join(starwars)
Result
Joining, by = "homeworld"
# A tibble: 21 x 15
homeworld n name height mass hair_color skin_color eye_color birth_year sex gender species films vehicles starships
<chr> <int> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <list> <list> <list>
1 Naboo 11 R2-D2 96 32 NA white, blue red 33 none masculine Droid <chr [7… <chr [0… <chr [0]>
2 Naboo 11 Palpatine 170 75 grey pale yellow 82 male masculine Human <chr [5… <chr [0… <chr [0]>
3 Naboo 11 Jar Jar Bin… 196 66 none orange orange 52 male masculine Gungan <chr [2… <chr [0… <chr [0]>
4 Naboo 11 Roos Tarpals 224 82 none grey orange NA male masculine Gungan <chr [1… <chr [0… <chr [0]>
5 Naboo 11 Rugor Nass 206 NA none green orange NA male masculine Gungan <chr [1… <chr [0… <chr [0]>
6 Naboo 11 Ric Olié 183 NA brown fair blue NA NA NA NA <chr [1… <chr [0… <chr [1]>
7 Naboo 11 Quarsh Pana… 183 NA black dark brown 62 NA NA NA <chr [1… <chr [0… <chr [0]>
8 Naboo 11 Gregar Typho 185 85 black dark brown NA male masculine Human <chr [1… <chr [0… <chr [1]>
9 Naboo 11 Cordé 157 NA brown light brown NA female feminine Human <chr [1… <chr [0… <chr [0]>
10 Naboo 11 Dormé 165 NA brown light brown NA female feminine Human <chr [1… <chr [0… <chr [0]>
# … with 11 more rows
Upvotes: 1
Reputation: 78917
Something like this?
starwars %>%
select(name, sex, homeworld, species) %>%
filter(!is.na(homeworld)) %>%
count(homeworld, name="worldcount", sort = TRUE) %>%
slice_max(n=2, order_by = worldcount, with_ties = FALSE)
homeworld worldcount
<chr> <int>
1 Naboo 11
2 Tatooine 10
Upvotes: 1
Reputation: 3604
starwars %>%
select(name, sex, homeworld, species) %>%
filter(!is.na(homeworld)) %>%
group_by(homeworld) %>%
count(name="world") %>%
arrange(desc(world)) %>%
ungroup() %>%
slice_max(world, n=2)
Upvotes: 0