Travis Heeter
Travis Heeter

Reputation: 14084

How to remove rows containing unique factor levels from a data frame in R?

I have a column called Keywords, and it is a factor data type with 150 levels. Most of these levels are combinations of other levels or typos. I'd like to remove all the rows whose Keyword is a level that only has 1-5 instances. How do I do that?

For instance:

Let's say I have 300 rows with 'a' as the keyword, a couple hundred 'b's and a few hundred 'c's. But then I have 100 other levels that should be one of those three, but are some variant like 'A 1' or 'A2'. I'm just trying to get an idea of the data, but all the relatively low-occuring levels are throwing off all the graphs.

Upvotes: 2

Views: 1841

Answers (2)

Ben Bolker
Ben Bolker

Reputation: 226911

Something like

tt <- table(dd$Keywords)
rare_levels <- names(tt)[tt<5]
dd <- subset(dd,!Keywords %in% rare_levels)

You can use either dd$Keywords <- factor(dd$Keywords) or dd$Keywords <- droplevels(dd$Keywords) after subsetting to drop the rare factor levels from the list of levels (the observations, not the levels themselves, get dropped by subset)

Upvotes: 5

Xinlu
Xinlu

Reputation: 140

you can use dplyr package function n()

library(dplyr)

mtcars %>% 
    mutate(cyl = as.factor(cyl)) %>% 
    group_by(cyl) %>% 
    filter(n() >12)  # require each level with more than 12 obs

Upvotes: 4

Related Questions