Dr.FishGirl
Dr.FishGirl

Reputation: 43

Conditionally remove certain rows in R

I have a very large data frame with fish species captured as one of the columns. Here is a very shortened example:

ID = seq(1,50,1)

fishes = c("bass", "jack", "snapper")
common = sample(fishes, size = 50, replace = TRUE)

dat = as.data.frame(cbind(ID, common))

I want to remove any species that make up less than a certain percentage of the data. For the example here say I want to remove all species that make up less than 30% of the data:

library(dplyr)
nrow(filter(dat, common == "bass"))      #22 rows -> 22/50 -> 44%
nrow(filter(dat, common == "jack"))      #12 rows -> 12/50 -> 24%
nrow(filter(dat, common == "snapper"))   #16 rows -> 16/50 -> 32%

Here, jacks make up less than 30% of the rows, so I want to remove all the rows with jacks (or all species with less than 15 rows). This is easy to do here, but in reality I have over 700 fish species in my data frame and I want to throw out all species that make up less than 1% of the data (which in my case would be less than 18,003 rows). Is there a streamlined way to do this without having to filter out each species individually?

I imagine perhaps some kind of loop that says if the number of rows for common name = "x" is less than 18003, remove those rows...

Upvotes: 0

Views: 165

Answers (2)

Piotr K
Piotr K

Reputation: 1025

You may also do it in one pipe:

library(dplyr)

dat %>%
  mutate(percentage = n()) %>%
  group_by(common) %>%
  mutate(percentage = n() / percentage) %>%
  filter(percentage > 0.3) %>%
  select(-percentage)

Upvotes: 3

Brian Fisher
Brian Fisher

Reputation: 1367

One way to approach this is to first create a summary table, then filter based on the summary stat. There are probably more direct ways to accomplish the same thing.

library(dplyr)
set.seed(914) # so you get the same results from sample()

ID = seq(1,50,1)

fishes = c("bass", "jack", "snapper")
common = sample(fishes, size = 50, replace = TRUE)

dat = as.data.frame(cbind(ID, common))  # same as your structure, but I ended up with different species mix

summ.table <-  dat %>%
      group_by(common) %>%
      summarize(number = n()) %>%
      mutate(pct= number/sum(number))
summ.table
# # A tibble: 3 x 3
# common  number   pct
# <fct>    <int> <dbl>
# 1 bass        18  0.36
# 2 jack        18  0.36
# 3 snapper     14  0.28

include <- summ.table$common[summ.table$pct > .3]

dat.selected = filter(dat, common %in% include)

Upvotes: 2

Related Questions