Reputation: 43
I have a very large data frame with fish species captured as one of the columns. Here is a very shortened example:
ID = seq(1,50,1)
fishes = c("bass", "jack", "snapper")
common = sample(fishes, size = 50, replace = TRUE)
dat = as.data.frame(cbind(ID, common))
I want to remove any species that make up less than a certain percentage of the data. For the example here say I want to remove all species that make up less than 30% of the data:
library(dplyr)
nrow(filter(dat, common == "bass")) #22 rows -> 22/50 -> 44%
nrow(filter(dat, common == "jack")) #12 rows -> 12/50 -> 24%
nrow(filter(dat, common == "snapper")) #16 rows -> 16/50 -> 32%
Here, jacks make up less than 30% of the rows, so I want to remove all the rows with jacks (or all species with less than 15 rows). This is easy to do here, but in reality I have over 700 fish species in my data frame and I want to throw out all species that make up less than 1% of the data (which in my case would be less than 18,003 rows). Is there a streamlined way to do this without having to filter out each species individually?
I imagine perhaps some kind of loop that says if the number of rows for common name = "x" is less than 18003, remove those rows...
Upvotes: 0
Views: 165
Reputation: 1025
You may also do it in one pipe:
library(dplyr)
dat %>%
mutate(percentage = n()) %>%
group_by(common) %>%
mutate(percentage = n() / percentage) %>%
filter(percentage > 0.3) %>%
select(-percentage)
Upvotes: 3
Reputation: 1367
One way to approach this is to first create a summary table, then filter based on the summary stat. There are probably more direct ways to accomplish the same thing.
library(dplyr)
set.seed(914) # so you get the same results from sample()
ID = seq(1,50,1)
fishes = c("bass", "jack", "snapper")
common = sample(fishes, size = 50, replace = TRUE)
dat = as.data.frame(cbind(ID, common)) # same as your structure, but I ended up with different species mix
summ.table <- dat %>%
group_by(common) %>%
summarize(number = n()) %>%
mutate(pct= number/sum(number))
summ.table
# # A tibble: 3 x 3
# common number pct
# <fct> <int> <dbl>
# 1 bass 18 0.36
# 2 jack 18 0.36
# 3 snapper 14 0.28
include <- summ.table$common[summ.table$pct > .3]
dat.selected = filter(dat, common %in% include)
Upvotes: 2