A_K
A_K

Reputation: 1

Categorize large factor into small factors based on frequency with remaining entries as 'Others'

I have a large factor (df$name) with more than 1000 factors. What I need is the top 10-15 factors by frequency and the remaining factors clubbed together as 'others'

I tried using the following command but wasn't successful: df$name <- levels(df$name)[which(table(df$name)<1000000)] <- "Others"

PS: I'm using a frequency count since I don't want to restrict myself with a specific count of factors here. I'm happy if I get anywhere from 5-20 top factors (by frequency) and the rest of them combined together as 'Others' for easy visualization.

Upvotes: 0

Views: 47

Answers (2)

Jon Spring
Jon Spring

Reputation: 66870

Here's a column in a data frame with 2000 factors:

df <- data.frame(names = sample(1:2000, 1E6, replace = T))
df$names <- as.factor(df$names)

And here a new variable is added which keeps the top 15 and puts the rest in "Other."

df$names_lump = forcats::fct_lump(df$names, n = 15)

Upvotes: 0

koki25ando
koki25ando

Reputation: 74

First of all, I would count name frequency by using table() & top_n() to specify top 15 (or 10) names in your data set. (I contained them in top_15_names object.) After that I did create name_category column to show groups of names by using mutate(). Here is how I would do it.

df$name = as.factor(df$name)
top_15 = data.frame(table(df$name)) %>% 
  arrange(desc(Freq)) %>% 
  top_n(15)
top_15_names = top_15$Var1

dat = df %>% 
  mutate(name_category = case_when(
    name %in%  top_15_names ~ "Top15",
    TRUE ~ "Others"
  ))

I hope you find this helpful.

Upvotes: 0

Related Questions