spidermarn
spidermarn

Reputation: 939

How do I keep only the Top n rows in a column

I have a dataframe like this:

ID | Item |
1  | A    |
1  | B    |
2  | A    |
3  | B    |
1  | C    |
4  | C    |
5  | B    |
3  | A    |
4  | A    |
5  | D    |

In R, how do I keep only the Top N values (by count) in Item Column? I would also like to fill the values which are not in Top N with "Others". Expected result if Top 2 values:

ID | Item   |
1  | A      |
1  | B      |
2  | A      |
3  | B      |
1  | Others |
4  | Others |
5  | B      |
3  | A      |
4  | A      |
5  | Others |

How do I handle tie cases too?

Upvotes: 1

Views: 284

Answers (4)

stefan
stefan

Reputation: 123903

One way to achieve this is fct_lump from forcats package. Example code:

library(dplyr)
library(forcats)
set.seed(42)

df <- tibble(
    id = sample(1:5, 10, replace = TRUE), 
    item = sample(LETTERS[1:4], 10, replace = TRUE))

df %>% 
    dplyr::mutate(item = forcats::fct_lump(item, 2))

Output:

# A tibble: 10 x 2
      id item 
   <int> <fct>
 1     1 A    
 2     5 A    
 3     1 B    
 4     1 Other
 5     2 B    
 6     4 B    
 7     2 Other
 8     2 Other
 9     1 A    
10     4 A  

Upvotes: 2

Richard Telford
Richard Telford

Reputation: 9923

You can use forcats::fct_lump

library(tidyverse)
df <- read_delim("ID | Item 
1  | A    
1  | B    
2  | A    
3  | B    
1  | C    
4  | C    
5  | B    
3  | A    
4  | A    
5  | D", delim = "|", trim_ws = TRUE)


df %>% 
  mutate(Item = forcats::fct_lump(Item, n = 2))
#> # A tibble: 10 x 2
#>       ID Item 
#>    <dbl> <fct>
#>  1     1 A    
#>  2     1 B    
#>  3     2 A    
#>  4     3 B    
#>  5     1 Other
#>  6     4 Other
#>  7     5 B    
#>  8     3 A    
#>  9     4 A    
#> 10     5 Other

Created on 2020-03-03 by the reprex package (v0.3.0)

There is an argument ties.method with options for dealing with ties

Upvotes: 2

Sotos
Sotos

Reputation: 51582

A base R solution,

replace(as.character(df$Item), !df$Item %in% names(table(df$Item)[1:2]), 'Others')

#[1] "A"      "B"      "A"      "B"      "Others" "Others" "B"      "A"      "A"      "Others"

Upvotes: 1

tmfmnk
tmfmnk

Reputation: 39858

One dplyr option could be:

df %>%
 add_count(Item) %>%
 mutate(Item = if_else(dense_rank(desc(n)) %in% 1:2, Item, "Others")) %>%
 select(-n)

      ID Item  
   <int> <chr> 
 1     1 A     
 2     1 B     
 3     2 A     
 4     3 B     
 5     1 Others
 6     4 Others
 7     5 B     
 8     3 A     
 9     4 A     
10     5 Others

Upvotes: 1

Related Questions