twieg
twieg

Reputation: 73

Removing rows of dataframe based on frequency of a variable

I'm working with a dataframe (in R) that contains observations of animals in the wild (recording time/date, location, and species identification). I want to remove rows that contain a certain species if there are less than x observations of them in the whole dataframe. As of now, I managed to get it to work with the following code, but I know there must be a more elegant and efficient way to do it.

namelist <- names(table(ind.data$Species))
for (i in 1:length(namelist)) {
  if (table(ind.data$Species)[namelist[i]] <= 2) {
    while (namelist[i] %in% ind.data$Species) {
      j <- match(namelist[i], ind.data$Species)
      ind.data <- ind.data[-j,]
    }
  }
}

The namelist vector contains all the species names in the data frame ind.data, and the if statement checks to see if the frequency of the ith name on the list is less than x (2 in this example).

I'm fully aware that this is not a very clean way to do it, I just threw it together at the end of the day yesterday to see if it would work. Now I'm looking for a better way to do it, or at least for how I could refine it.

Upvotes: 1

Views: 1767

Answers (2)

akrun
akrun

Reputation: 887881

We can use data.table

library(data.table)
setDT(ind.data)[, .SD[.N >2], Species]

Upvotes: 0

David Robinson
David Robinson

Reputation: 78630

You can do this with the dplyr package:

library(dplyr)

new.ind.data <- ind.data %>%
  group_by(Species) %>%
  filter(n() > 2) %>%
  ungroup()

An alternative using built-in functions is to use ave():

group_sizes <- ave(ind.data$Species, ind.data$Species, FUN = length)
new.ind.data <- ind.data[group_sizes > 2, ]

Upvotes: 1

Related Questions