JMac
JMac

Reputation: 1

Calculating descriptive stats on categorical grouped variables in large dataset (R)

I am working with a large (~500,000) dataset on the sex, parentage, and species of a group of animals. I need to calculate the median sex ratio per species. Each row in my dataset is an individual animal’s ID (AnonID), the species name, the sex of the individual, and the mother's ID (DamAnonID). Here is an example of the dataset:

df <- data.frame(
  AnonID = c(1,2,3,4,5,6,7,8,9,10),
  Species = c("A","A", "A","B","C","C","A","C","B","C"),
  Sex = c("F","F","M","F","M","F","F","F","M","M"),
  DamAnonID = c(0,1,1,2,2,2,4,4,4,6)
)

Note: DamAnonID is the AnonID of the mother and most mothers are in the dataset itself.

I would like to end with dataset like this:

Species <- c("A", "B", "C")
MedianSexRatio <- c(.33,.33,1)

I am still a beginner at R and so I have been thinking about it in the following steps:

  1. Count number of male and female offspring per dam
  2. Calculate sex ratio per dam (M/M+F)
  3. Calculate median sex ratio per species

I am unfortunately stuck at the very beginning. I cannot figure out how to count numbers of male and female offspring per dam. I have been using group_by in dpylr to group by DamAnonID, but can't figure out how to count categorical variables in summarise and am not sure of other coding options. I have also tried

data %>% 
group_by (DamAnonID, Sex) %>% 
tally()

This does the correct calculation, but I end up with two rows per DamAnonID – one for male offspring and one for female offspring. I need one row per DamAnonID with one column for male offspring and one column for female offspring.

I would appreciate any help into how I could calculate sex ratios per dam or, if there’s a way to just focus on calculating median sex ratio per species, I would greatly appreciate that too! Thanks in advance!

Upvotes: 0

Views: 53

Answers (2)

Jon Spring
Jon Spring

Reputation: 66480

With dplyr 1.1.0 we have the .by argument:

df %>%
  summarize(sex_ratio = sum(Sex == "M") / n(), .by = c(Species, DamAnonID)) %>%
  summarize(median_ratio = median(sex_ratio), .by = Species)

Or alternatively (which works for prior dplyr too):

df %>%
  group_by(Species, DamAnonID) %>%
  summarise(sex_ratio = sum(Sex == "M") / n()) %>% # grouped by Species and Dam
  summarize(median_ratio = median(sex_ratio))      # now just grouped by Species

The output of the first two lines is:

  Species DamAnonID sex_ratio
1       A         0       0.0
2       A         1       0.5
3       B         2       0.0
4       C         2       0.5
5       A         4       0.0
6       C         4       0.0
7       B         4       1.0
8       C         6       1.0

...which fed into the 3rd line gives:

  Species median_ratio
1       A          0.0
2       B          0.5
3       C          0.5

Upvotes: 0

DaveArmstrong
DaveArmstrong

Reputation: 21937

You can just pivot the data after it is calculated:

library(dplyr)
library(tidyr)
df <- data.frame(
  AnonID = c(1,2,3,4,5,6,7,8,9,10),
  Species = c("A","A", "A","B","C","C","A","C","B","C"),
  Sex = c("F","F","M","F","M","F","F","F","M","M"),
  DamAnonID = c(0,1,1,2,2,2,4,4,4,6)
)
df %>% 
  group_by (DamAnonID, Sex) %>% 
  tally() %>% 
  pivot_wider(names_from = "Sex", values_from="n", values_fill = 0)
#> # A tibble: 5 × 3
#> # Groups:   DamAnonID [5]
#>   DamAnonID     F     M
#>       <dbl> <int> <int>
#> 1         0     1     0
#> 2         1     1     1
#> 3         2     2     1
#> 4         4     2     1
#> 5         6     0     1

Alternatively, you could just sum the observations of Sex that equal M and F, respectively grouping only by DamAnonID:

df %>% 
  group_by (DamAnonID) %>% 
  summarise(F = sum(Sex == "F"), 
            M = sum(Sex == "M"))
#> # A tibble: 5 × 3
#>   DamAnonID     F     M
#>       <dbl> <int> <int>
#> 1         0     1     0
#> 2         1     1     1
#> 3         2     2     1
#> 4         4     2     1
#> 5         6     0     1

Created on 2023-03-02 with reprex v2.0.2

Upvotes: 0

Related Questions