YBM
YBM

Reputation: 47

Calculate percentage while grouping subgroups

I am looking for an R solution. I have the following data frame (this is a sample):

df <- data.frame(groupID = c("Jon", "Jon", "Jon","Jon", "Jon", "Maria", "Maria", "Ben", "Ben", "Tina", "Tina"),
                 breeding_attempt = c(1, 1, 1, 2, 2, 1, 1 , 1, 1, 1, 1),
                 year = c(1999, 1999, 1999, 1999, 1999, 2000, 2000, 2000, 2000, 2001, 2001),
                 femaleID = c("Jony", "Jona", "sami", "Jon", "Jona", "aa", "BB", "Tana", "tt", "gg", "HH"),
                 chicks = c(3, 0, 0, 0, 0, 2, 1, 3, 4, 1, 0))

I need to do 2 actions, both of which while considering the breeding_attempt per year per groupID as the unit of calculation.

(a) how do I remove from the data all breeding_attempts in the same year AND for the same groupID in which all the participating females had 0 chicks? (e.g. breeding_attepmt 2, year 1999, group "Jon" need to be removed) Please, note that the grouping needs to have 3 levels GroupID->year->breeding_attempt

(b) After having the data subsetted as in (a), how do I calculate from the subsetted data the percentage of breeding_attempts, per year AND per group in which only 1 female had >0 chicks and all other participating females add 0 chicks? (i.e. the percentage of breeding_attempts with a single successful female out of all breeding_attempts). In this sample, it should be 50% as groups "jon" 1999 1 and "Tina" 2000 1 had only one successful female.

Ideally, I will also be able to get a data frame that summarises the raw data. Namely, a dataframe in which each line represents a breeding_attempt per year per group ID and a column indicating whether there was only 1 successful female or not.

I tried working with the aggregate function, but I am new to R and did not get far with it...

Thanks!

Upvotes: 0

Views: 56

Answers (1)

jkd
jkd

Reputation: 1664

Since you seem to be looking for a base R solution, here is mine:

# Question a
agg_a <- aggregate(chicks~groupID+year+breeding_attempt, data=df, sum)
df2 <- subset(df, !(groupID %in% agg_a$groupID[agg_a$chicks==0] &
                      year %in% agg_a$year[agg_a$chicks==0] &
                      breeding_attempt %in% agg_a$breeding_attempt[agg_a$chicks==0]))

# Question b
agg_b <- aggregate(chicks>0~groupID+year+breeding_attempt, data=df2, sum)
agg_b$just1 <- agg_b$`chicks > 0`==1
sum(agg_b$just1)/nrow(agg_b)

I think the agg_b data.frame provides the summarization you were also looking for.

Since you are new to R and tried to use aggregate, you may not know that there is a framework in R called the tidyverse which has a specific syntax and is often opposed to base R. For beginners it may be difficult to learn the base R and the tidyverse way of doing things at the same time, which is why you may want to stick with base R at the moment.

Nevertheless, here is a possible tidyverse solution:

# Question a
df2 <- df |>
  group_by(groupID, year, breeding_attempt) |>
  filter(sum(chicks)>0)

# Question b
agg_b <- df2 |>
  group_by(groupID, year, breeding_attempt) |>
  summarise(just1=sum(chicks>0)==1)
sum(agg_b$just1)/nrow(agg_b)

Upvotes: 1

Related Questions