Reputation: 14084
I have a column called Keywords
, and it is a factor data type with 150 levels. Most of these levels are combinations of other levels or typos. I'd like to remove all the rows whose Keyword
is a level that only has 1-5 instances. How do I do that?
For instance:
Let's say I have 300 rows with 'a' as the keyword, a couple hundred 'b's and a few hundred 'c's. But then I have 100 other levels that should be one of those three, but are some variant like 'A 1' or 'A2'. I'm just trying to get an idea of the data, but all the relatively low-occuring levels are throwing off all the graphs.
Upvotes: 2
Views: 1841
Reputation: 226911
Something like
tt <- table(dd$Keywords)
rare_levels <- names(tt)[tt<5]
dd <- subset(dd,!Keywords %in% rare_levels)
You can use either dd$Keywords <- factor(dd$Keywords)
or dd$Keywords <- droplevels(dd$Keywords)
after subsetting to drop the rare factor levels from the list of levels (the observations, not the levels themselves, get dropped by subset
)
Upvotes: 5
Reputation: 140
you can use dplyr
package function n()
library(dplyr)
mtcars %>%
mutate(cyl = as.factor(cyl)) %>%
group_by(cyl) %>%
filter(n() >12) # require each level with more than 12 obs
Upvotes: 4