Reputation: 1
I have a large factor (df$name) with more than 1000 factors. What I need is the top 10-15 factors by frequency and the remaining factors clubbed together as 'others'
I tried using the following command but wasn't successful: df$name <- levels(df$name)[which(table(df$name)<1000000)] <- "Others"
PS: I'm using a frequency count since I don't want to restrict myself with a specific count of factors here. I'm happy if I get anywhere from 5-20 top factors (by frequency) and the rest of them combined together as 'Others' for easy visualization.
Upvotes: 0
Views: 47
Reputation: 66870
Here's a column in a data frame with 2000 factors:
df <- data.frame(names = sample(1:2000, 1E6, replace = T))
df$names <- as.factor(df$names)
And here a new variable is added which keeps the top 15 and puts the rest in "Other."
df$names_lump = forcats::fct_lump(df$names, n = 15)
Upvotes: 0
Reputation: 74
First of all, I would count name frequency by using table()
& top_n()
to specify top 15 (or 10) names in your data set. (I contained them in top_15_names
object.) After that I did create name_category
column to show groups of names by using mutate()
. Here is how I would do it.
df$name = as.factor(df$name)
top_15 = data.frame(table(df$name)) %>%
arrange(desc(Freq)) %>%
top_n(15)
top_15_names = top_15$Var1
dat = df %>%
mutate(name_category = case_when(
name %in% top_15_names ~ "Top15",
TRUE ~ "Others"
))
I hope you find this helpful.
Upvotes: 0