Reputation: 939
I have a dataframe like this:
ID | Item |
1 | A |
1 | B |
2 | A |
3 | B |
1 | C |
4 | C |
5 | B |
3 | A |
4 | A |
5 | D |
In R, how do I keep only the Top N values (by count) in Item Column? I would also like to fill the values which are not in Top N with "Others". Expected result if Top 2 values:
ID | Item |
1 | A |
1 | B |
2 | A |
3 | B |
1 | Others |
4 | Others |
5 | B |
3 | A |
4 | A |
5 | Others |
How do I handle tie cases too?
Upvotes: 1
Views: 284
Reputation: 123903
One way to achieve this is fct_lump
from forcats
package. Example code:
library(dplyr)
library(forcats)
set.seed(42)
df <- tibble(
id = sample(1:5, 10, replace = TRUE),
item = sample(LETTERS[1:4], 10, replace = TRUE))
df %>%
dplyr::mutate(item = forcats::fct_lump(item, 2))
Output:
# A tibble: 10 x 2
id item
<int> <fct>
1 1 A
2 5 A
3 1 B
4 1 Other
5 2 B
6 4 B
7 2 Other
8 2 Other
9 1 A
10 4 A
Upvotes: 2
Reputation: 9923
You can use forcats::fct_lump
library(tidyverse)
df <- read_delim("ID | Item
1 | A
1 | B
2 | A
3 | B
1 | C
4 | C
5 | B
3 | A
4 | A
5 | D", delim = "|", trim_ws = TRUE)
df %>%
mutate(Item = forcats::fct_lump(Item, n = 2))
#> # A tibble: 10 x 2
#> ID Item
#> <dbl> <fct>
#> 1 1 A
#> 2 1 B
#> 3 2 A
#> 4 3 B
#> 5 1 Other
#> 6 4 Other
#> 7 5 B
#> 8 3 A
#> 9 4 A
#> 10 5 Other
Created on 2020-03-03 by the reprex package (v0.3.0)
There is an argument ties.method
with options for dealing with ties
Upvotes: 2
Reputation: 51582
A base R solution,
replace(as.character(df$Item), !df$Item %in% names(table(df$Item)[1:2]), 'Others')
#[1] "A" "B" "A" "B" "Others" "Others" "B" "A" "A" "Others"
Upvotes: 1
Reputation: 39858
One dplyr
option could be:
df %>%
add_count(Item) %>%
mutate(Item = if_else(dense_rank(desc(n)) %in% 1:2, Item, "Others")) %>%
select(-n)
ID Item
<int> <chr>
1 1 A
2 1 B
3 2 A
4 3 B
5 1 Others
6 4 Others
7 5 B
8 3 A
9 4 A
10 5 Others
Upvotes: 1