Jill Walker Rettberg
Jill Walker Rettberg

Reputation: 11

How to select most common values in a column

I have a table of mostly categorical values and want to only keep rows that have the most common values in a particular column. I'm trying to use slice_max() but it's not working as I expect. I did see older suggestions for how to do this in base R or using the deprecated top_n(), but the top_n() documentation says to use slice_max instead and I can't find much detail about how slice_max works.

I'll use the starwars dataset as my example. The two most common homeworlds are Naboo, with 11 occurrences, and Tatooine, with 10. So I want the code to say "show me all the rows with the two most common homeworlds", and I expect that to give me a 21 row tibble where the homeworlds are all Naboo and Tatooine.

I added a column I called "worldcount" that simply counts the occurrences of the homeworld so I can easily see how many times each homeworld occurs. I also only selected a few columns to simplify things:

starwars %>%
  select(name, sex, homeworld, species) %>% 
  filter(!is.na(homeworld)) %>% 
  add_count(homeworld, name="worldcount") %>%
  slice_max(worldcount, n=2)

# A tibble: 11 × 5
   name          sex    homeworld species worldcount
   <chr>         <chr>  <chr>     <chr>        <int>
 1 R2-D2         none   Naboo     Droid           11
 2 Palpatine     male   Naboo     Human           11
 3 Jar Jar Binks male   Naboo     Gungan          11
 4 Roos Tarpals  male   Naboo     Gungan          11
 5 Rugor Nass    male   Naboo     Gungan          11
 6 Ric Olié      NA     Naboo     NA              11
 7 Quarsh Panaka NA     Naboo     NA              11
 8 Gregar Typho  male   Naboo     Human           11
 9 Cordé         female Naboo     Human           11
10 Dormé         female Naboo     Human           11
11 Padmé Amidala female Naboo     Human           11

But this code ONLY returns the rows where Naboo is the homeworld. When I set n=2 in slice_max() I expect the top 2 homeworlds - but Tatoine isn't here?

I also tried using slice_max() directly on a column with categorical data, but I think this might be calculating the "max" based on alphabetical order, since it returns the two homeworlds starting with letters at the end of the alphabet:

    starwars %>% 
      select(name, sex, homeworld, species) %>% 
      filter(!is.na(homeworld)) %>% 
      slice_max(homeworld, n=2)

 A tibble: 2 × 4
  name       sex    homeworld species   
  <chr>      <chr>  <chr>     <chr>     
1 Zam Wesell female Zolan     Clawdite  
2 Dud Bolt   male   Vulpter   Vulptereen

Finally I tried using slice_max on numeric data that's already in the starwars dataset, but this also doesn't work the way I would expect.

If I ask for the 8 top heights, I get what I expect: 9 rows, because two characters in starwars have the same height:

    starwars %>% 
      select(name, height) %>% 
      slice_max(height, n=8)

# A tibble: 9 × 2
  name         height
  <chr>         <int>
1 Yarael Poof     264
2 Tarfful         234
3 Lama Su         229
4 Chewbacca       228
5 Roos Tarpals    224
6 Grievous        216
7 Taun We         213
8 Rugor Nass      206
9 Tion Medon      206

So if I set n=9 and ask for the top 9 heights, I should get rows for 10 different characters, right? But no - this produces exactly the same result:

    starwars %>% 
      select(name, height) %>% 
      slice_max(height, n=9)

# A tibble: 9 × 2
  name         height
  <chr>         <int>
1 Yarael Poof     264
2 Tarfful         234
3 Lama Su         229
4 Chewbacca       228
5 Roos Tarpals    224
6 Grievous        216
7 Taun We         213
8 Rugor Nass      206
9 Tion Medon      206

So have I misunderstood how slice_max works?

Or is there a different way I can get to just the rows with the two most common homeworlds?

Upvotes: 1

Views: 2343

Answers (5)

jpdugo17
jpdugo17

Reputation: 7106

Another possible approach:

library(tidyverse)

sw <- starwars %>%
  select(name, sex, homeworld, species) %>%
  filter(!is.na(homeworld)) %>%
  add_count(homeworld)

counts <- unique(sw$n) %>% sort(decreasing = TRUE)
sw %>%
  filter(n %in% counts[1:2])
#> # A tibble: 21 × 5
#>    name               sex    homeworld species     n
#>    <chr>              <chr>  <chr>     <chr>   <int>
#>  1 Luke Skywalker     male   Tatooine  Human      10
#>  2 C-3PO              none   Tatooine  Droid      10
#>  3 R2-D2              none   Naboo     Droid      11
#>  4 Darth Vader        male   Tatooine  Human      10
#>  5 Owen Lars          male   Tatooine  Human      10
#>  6 Beru Whitesun lars female Tatooine  Human      10
#>  7 R5-D4              none   Tatooine  Droid      10
#>  8 Biggs Darklighter  male   Tatooine  Human      10
#>  9 Anakin Skywalker   male   Tatooine  Human      10
#> 10 Palpatine          male   Naboo     Human      11
#> # … with 11 more rows

Created on 2022-01-23 by the reprex package (v2.0.1)

Upvotes: 0

r2evans
r2evans

Reputation: 160407

slice_max is going to give you a maximum number of rows, not necessarily number of unique homeworlds. Try this:

out <- starwars %>%
  filter(
    homeworld %in% head(names(sort(table(homeworld), decreasing = TRUE)), 10)
  )
out
# # A tibble: 39 x 14
#    name               height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
#    <chr>               <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  <chr>     <chr>   <lis> <list>   <list>   
#  1 Luke Skywalker        172    77 blond      fair       blue            19   male  mascu~ Tatooine  Human   <chr~ <chr [2~ <chr [2]>
#  2 C-3PO                 167    75 NA         gold       yellow         112   none  mascu~ Tatooine  Droid   <chr~ <chr [0~ <chr [0]>
#  3 R2-D2                  96    32 NA         white, bl~ red             33   none  mascu~ Naboo     Droid   <chr~ <chr [0~ <chr [0]>
#  4 Darth Vader           202   136 none       white      yellow          41.9 male  mascu~ Tatooine  Human   <chr~ <chr [0~ <chr [1]>
#  5 Leia Organa           150    49 brown      light      brown           19   fema~ femin~ Alderaan  Human   <chr~ <chr [1~ <chr [0]>
#  6 Owen Lars             178   120 brown, gr~ light      blue            52   male  mascu~ Tatooine  Human   <chr~ <chr [0~ <chr [0]>
#  7 Beru Whitesun lars    165    75 brown      light      blue            47   fema~ femin~ Tatooine  Human   <chr~ <chr [0~ <chr [0]>
#  8 R5-D4                  97    32 NA         white, red red             NA   none  mascu~ Tatooine  Droid   <chr~ <chr [0~ <chr [0]>
#  9 Biggs Darklighter     183    84 black      light      brown           24   male  mascu~ Tatooine  Human   <chr~ <chr [0~ <chr [1]>
# 10 Anakin Skywalker      188    84 blond      fair       blue            41.9 male  mascu~ Tatooine  Human   <chr~ <chr [2~ <chr [3]>
# # ... with 29 more rows
table(out$homeworld)
#    Alderaan Aleen Minor    Corellia   Coruscant      Kamino    Kashyyyk      Mirial       Naboo      Ryloth    Tatooine 
#           3           1           2           3           3           2           2          11           2          10 

base R

subset(starwars,
  homeworld %in% head(names(sort(table(homeworld), decreasing=TRUE)), 10))

Upvotes: 0

Jon Spring
Jon Spring

Reputation: 66415

starwars %>%
  count(homeworld, sort = TRUE) %>%
  slice(1:2) %>%
  left_join(starwars)

Result

Joining, by = "homeworld"
# A tibble: 21 x 15
   homeworld     n name         height  mass hair_color skin_color  eye_color birth_year sex    gender    species films    vehicles starships
   <chr>     <int> <chr>         <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr>  <chr>     <chr>   <list>   <list>   <list>   
 1 Naboo        11 R2-D2            96    32 NA         white, blue red               33 none   masculine Droid   <chr [7… <chr [0… <chr [0]>
 2 Naboo        11 Palpatine       170    75 grey       pale        yellow            82 male   masculine Human   <chr [5… <chr [0… <chr [0]>
 3 Naboo        11 Jar Jar Bin…    196    66 none       orange      orange            52 male   masculine Gungan  <chr [2… <chr [0… <chr [0]>
 4 Naboo        11 Roos Tarpals    224    82 none       grey        orange            NA male   masculine Gungan  <chr [1… <chr [0… <chr [0]>
 5 Naboo        11 Rugor Nass      206    NA none       green       orange            NA male   masculine Gungan  <chr [1… <chr [0… <chr [0]>
 6 Naboo        11 Ric Olié        183    NA brown      fair        blue              NA NA     NA        NA      <chr [1… <chr [0… <chr [1]>
 7 Naboo        11 Quarsh Pana…    183    NA black      dark        brown             62 NA     NA        NA      <chr [1… <chr [0… <chr [0]>
 8 Naboo        11 Gregar Typho    185    85 black      dark        brown             NA male   masculine Human   <chr [1… <chr [0… <chr [1]>
 9 Naboo        11 Cordé           157    NA brown      light       brown             NA female feminine  Human   <chr [1… <chr [0… <chr [0]>
10 Naboo        11 Dormé           165    NA brown      light       brown             NA female feminine  Human   <chr [1… <chr [0… <chr [0]>
# … with 11 more rows

Upvotes: 1

TarJae
TarJae

Reputation: 78917

Something like this?

starwars %>%
  select(name, sex, homeworld, species) %>% 
  filter(!is.na(homeworld)) %>% 
  count(homeworld, name="worldcount", sort = TRUE)  %>% 
  slice_max(n=2, order_by = worldcount, with_ties = FALSE)
  homeworld worldcount
  <chr>          <int>
1 Naboo             11
2 Tatooine          10

Upvotes: 1

Grzegorz Sapijaszko
Grzegorz Sapijaszko

Reputation: 3604

starwars %>%
  select(name, sex, homeworld, species) %>% 
  filter(!is.na(homeworld)) %>%
  group_by(homeworld) %>%
  count(name="world") %>% 
  arrange(desc(world)) %>%
  ungroup() %>%
  slice_max(world, n=2)

Upvotes: 0

Related Questions