Conditionally remove certain rows in R

Question

I have a very large data frame with fish species captured as one of the columns. Here is a very shortened example:

ID = seq(1,50,1)

fishes = c("bass", "jack", "snapper")
common = sample(fishes, size = 50, replace = TRUE)

dat = as.data.frame(cbind(ID, common))

I want to remove any species that make up less than a certain percentage of the data. For the example here say I want to remove all species that make up less than 30% of the data:

library(dplyr)
nrow(filter(dat, common == "bass"))      #22 rows -> 22/50 -> 44%
nrow(filter(dat, common == "jack"))      #12 rows -> 12/50 -> 24%
nrow(filter(dat, common == "snapper"))   #16 rows -> 16/50 -> 32%

Here, jacks make up less than 30% of the rows, so I want to remove all the rows with jacks (or all species with less than 15 rows). This is easy to do here, but in reality I have over 700 fish species in my data frame and I want to throw out all species that make up less than 1% of the data (which in my case would be less than 18,003 rows). Is there a streamlined way to do this without having to filter out each species individually?

I imagine perhaps some kind of loop that says if the number of rows for common name = "x" is less than 18003, remove those rows...

Piotr K · Accepted Answer

You may also do it in one pipe:

library(dplyr)

dat %>%
  mutate(percentage = n()) %>%
  group_by(common) %>%
  mutate(percentage = n() / percentage) %>%
  filter(percentage > 0.3) %>%
  select(-percentage)

Conditionally remove certain rows in R

Answers (2)

Related Questions