vorpal
vorpal

Reputation: 318

dplyr filter columns with value 0 for all rows with unique combinations of other columns

I have a dataframe that looks like this:

df <- tibble(date = c(2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01), 
             site = c("X", "X", "X", "X", "Z", "Z", "Z", "Z"), 
             treatment = c("a", "a", "b", "b", "a", "a", "b", "b"),
             species = c("vetch", "clover", "vetch", "clover", "vetch", "clover", "vetch", "clover"),
             frequency = c(0, 1, 1, 1 1, 0, 1, 0))

But with lots of dates and sites and treatments. What I want is to filter out observations where all frequencies of that species (across all treatments and dates) is 0 for that site. So in the above I want to remove clover at site "Z" because it did not occur at any treatment or date at that site, but I want to leave clover in site "X" because it did occur in one of the treatments. So I want:

tibble(date = c(2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01),
       site = c("X", "X", "X" "X", "Z", "Z"),
       treatment = c("a", "a", "b", "b", "a", "b"),
       species = c("vetch", "clover", "vetch", "clover", "vetch", "vetch")
       frequency = c(0, 1, 1, 1, 1, 1))

My first thought was to pivot_wider, select columns then pivot_longer again, but this didn't work because the clover column was still selected by having a 1 in site "X":

  df %>%
    pivot_wider(names_from = species, names_prefix = "spp.", values_from = frequency, values_fill = 0) %>%
    group_by(site) %>%
    select_if(~ !is.numeric(.) || sum(.) != 0) %>%
    pivot_longer(starts_with("spp."), names_to = "species", names_prefix = "spp.", values_to = "frequency") -> df

So I guess I need to filter instead, but I can't figure out how to do that.

Upvotes: 0

Views: 1919

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388817

Maybe not for this dataset but generally using sum might not be the right approach since if you have negative numbers it might cancel it out and you'll get wrong groups removed. You can use all or any :

With dplyr :

library(dplyr)
df %>% group_by(date, site, species) %>% filter(any(frequency != 0))
#Also
#df %>% group_by(date, site, species) %>% filter(!all(frequency == 0))

#  date site  treatment species frequency
#  <dbl> <chr> <chr>     <chr>       <dbl>
#1  2018 X     a         vetch           0
#2  2018 X     a         clover          1
#3  2018 X     b         vetch           1
#4  2018 X     b         clover          1
#5  2018 Z     a         vetch           1
#6  2018 Z     b         vetch           1

The same can be done in data.table as well :

library(data.table)
setDT(df)[, .SD[any(frequency != 0)], .(date, site, species)]

Or in base R :

subset(df, ave(frequency != 0, date, site, species, FUN = any))

Upvotes: 4

An easy solution can be achieved by creating another column that contains the frequency of each species grouped by date, site and species (ignoring treatment). Then you can easily filter using this new column and afterwards eliminate it.

library(tidyverse)
df %>%
    # Group by date site and species
    group_by(date, site, species) %>%
    # Create new column that sums frequency values by grouping variables
    mutate(appears = sum(frequency)) %>%
    # ignore rows where appears = 0
    filter(appears != 0) %>%
    # Eliminate appears column
    select(-appears)

Upvotes: 3

Related Questions