Reputation: 147

Rename categorical levels to reduce the number of levels

I have a categorical column which has around 1200 levels in a data-set of around 78000 records. I want to reduce the levels on the basis of occurrences. For eg. -:

all levels occurring more than 2000 times be renamed to 'A'.
all levels occurring more than 1000 times but less than 2000 times be renamed to 'B'
all levels occurring more than 900 times but less than 1000 times be renamed to 'C'

And so on.

I don't want to group less frequently occurring levels into 'Others' as it would hide a lot of important levels.
Following is the dataframe for example.

df=data.frame(
  ID = c(1:10),
  Name = c("Jack", "Mike","Jack", "Mike","Jack", "Mike", "Tom", "Tom", "Smith", "Tony")
)

Here I would like to reduce the levels of column 'Name' by-:

Renaming all levels occurring >=3 times as 'A'
Renaming all levels occurring >=2 but <3 times as 'B'
Renaming all levels occurring <2 times as 'C'

Can anyone help me to do it in R?

Upvotes: 3

Answers (3)

akrun

Reputation: 887951

We can use fcase from the devel version of data.table (1.12.9), which would also do evaluation lazily

library(data.table)
setDT(df)[, NewName := fcase(.N >=3, 'A',
                             .N >=2 & .N < 3, 'B',
                             default = 'C'), Name][]
#    ID  Name NewName
# 1:  1  Jack       A
# 2:  2  Mike       A
# 3:  3  Jack       A
# 4:  4  Mike       A
# 5:  5  Jack       A
# 6:  6  Mike       A
# 7:  7   Tom       B
# 8:  8   Tom       B
# 9:  9 Smith       C
#10: 10  Tony       C

Or using base R with findInterval

with(df, rev(LETTERS[1:3])[findInterval(table(Name)[Name], 2:3) + 1])
#[1] "A" "A" "A" "A" "A" "A" "B" "B" "C" "C"

Upvotes: 0

G5W

Reputation: 37661

A base R solution using table

NameCount = table(df$Name)[df$Name]
NewName = rep("C", length(NameCount))
NewName[NameCount >= 2] = "B"
NewName[NameCount >= 3] = "A"
NewName
 [1] "A" "A" "A" "A" "A" "A" "B" "B" "C" "C"

Upvotes: 2

Ronak Shah

Reputation: 389325

We can use add_count to count instances of Name and then check conditions with case_when

library(dplyr)

df %>% 
  add_count(Name) %>%
  mutate(NewName = case_when(n >= 3 ~'A', 
                             n >= 2 & n < 3 ~'B', 
                             TRUE ~'C')) %>%
  select(-n, -Name)

#     ID NewName
#   <int> <chr>  
# 1     1 A      
# 2     2 A      
# 3     3 A      
# 4     4 A      
# 5     5 A      
# 6     6 A      
# 7     7 B      
# 8     8 B      
# 9     9 C      
#10    10 C

Upvotes: 3

Rename categorical levels to reduce the number of levels

Answers (3)

Related Questions