Brian Leo
Brian Leo

Reputation: 11

Use dplyr to calculate percentage and frequency of occurrence of two groups

I'm learning dplyr and have searched for solutions from similar posts but found none with this combination of problems.

Here is an example data frame:

set.seed(1)
    df <- data.frame(sampleID = c(rep("sample1",2),
                                 rep("sample2",3),
                                 rep("sample3",4)),
                     species = c("clover","nettle",
                                 "clover","nettle","vine",
                                 "clover","clover","nettle","vine"),
                     type = c("vegetation","seed",
                              "vegetation","vegetation","vegetation",
                              "seed","vegetation","seed","vegetation"),
                     mass = sample(1:9))

    > df
  sampleID species       type mass
1  sample1  clover vegetation    9
2  sample1  nettle       seed    4
3  sample2  clover vegetation    7
4  sample2  nettle vegetation    1
5  sample2    vine vegetation    2
6  sample3  clover       seed    6
7  sample3  clover vegetation    3
8  sample3  nettle       seed    8
9  sample3    vine vegetation    5

I need to return a data frame that calculates percent mass for each unique species/type combination, and I need percent frequency of species/type occurrence within sampleIDs

So the solution for the species/type of vine/vegetation in this example would be Percent mass = (5+2)/(sum(mass)) and the percent frequency would be 2/3 since that combination did not occur in sample1.

To start I tried different combinations such as:

df %>%
  group_by(species,type) %>%
  summarize(totmass = sum(mass))  %>%
  mutate(percmass = totmass/sum(totmass))

but that gives a 100% mass for vine/vegetation? Also I would not know where to go from there to get the percent frequencies based on sampleID.

Upvotes: 0

Views: 397

Answers (1)

stefan
stefan

Reputation: 124048

Not sure whether I got you right but maybe this is what you are looking for:

set.seed(1)
df <- data.frame(sampleID = c(rep("sample1",2),
                              rep("sample2",3),
                              rep("sample3",4)),
                 species = c("clover","nettle",
                             "clover","nettle","vine",
                             "clover","clover","nettle","vine"),
                 type = c("vegetation","seed",
                          "vegetation","vegetation","vegetation",
                          "seed","vegetation","seed","vegetation"),
                 mass = sample(1:9))

library(dplyr)

df %>%
  # Add total mass
  add_count(wt = mass, name = "sum_mass") %>%
  # Add total number of samples
  add_count(nsamples = n_distinct(sampleID)) %>%
  # Add sum_mass and nsamples to group_by
  group_by(species, type, sum_mass, nsamples) %>%
  summarize(nsample = n_distinct(sampleID), 
            totmass = sum(mass), .groups = "drop")  %>%
  mutate(percmass = totmass / sum_mass,
         percfreq = nsample / nsamples)
#> # A tibble: 5 x 8
#>   species type       sum_mass nsamples nsample totmass percmass percfreq
#>   <chr>   <chr>         <int>    <int>   <int>   <int>    <dbl>    <dbl>
#> 1 clover  seed             45        3       1       6   0.133     0.333
#> 2 clover  vegetation       45        3       3      19   0.422     1    
#> 3 nettle  seed             45        3       2      12   0.267     0.667
#> 4 nettle  vegetation       45        3       1       1   0.0222    0.333
#> 5 vine    vegetation       45        3       2       7   0.156     0.667

Upvotes: 1

Related Questions