How to remove rows containing unique factor levels from a data frame in R?

Question

I have a column called Keywords, and it is a factor data type with 150 levels. Most of these levels are combinations of other levels or typos. I'd like to remove all the rows whose Keyword is a level that only has 1-5 instances. How do I do that?

For instance:

Let's say I have 300 rows with 'a' as the keyword, a couple hundred 'b's and a few hundred 'c's. But then I have 100 other levels that should be one of those three, but are some variant like 'A 1' or 'A2'. I'm just trying to get an idea of the data, but all the relatively low-occuring levels are throwing off all the graphs.

Ben Bolker · Accepted Answer

Something like

tt <- table(dd$Keywords)
rare_levels <- names(tt)[tt<5]
dd <- subset(dd,!Keywords %in% rare_levels)

You can use either dd$Keywords <- factor(dd$Keywords) or dd$Keywords <- droplevels(dd$Keywords) after subsetting to drop the rare factor levels from the list of levels (the observations, not the levels themselves, get dropped by subset)

How to remove rows containing unique factor levels from a data frame in R?

Answers (2)

Related Questions