Jennifer Diamond
Jennifer Diamond

Reputation: 113

How to find the mean of certain values in a large dataframe using loop

In R, I have a data frame that looks like this:

         Female.ID    Mate.ID  relatedness
    1           A1         C1       0.0000
    2           A1         D1       0.0000 
    3           A1         E1       0.5062
    4           A1         F1           NA
    5           B1         G1       0.0425
    6           B1         H1       0.0000
    7           B1         I1       0.0349
    8           B1         J1       0.0000
    9           B1         K1       0.0000
    10          B1         L1       0.0887
    11          B1         M1       0.1106
    12          B1         N1       0.0000

I want to create a new dataframe and find the mean relatedness of all the mates for female.ID A1 and the mean relatedness for all the mates of female.ID B1, etc.

I want something like this:

    Female.ID    mean.relatedness
           A1              0.1687
           B1              0.0346

This dataframe is much bigger than this example one, which is why I'm not just subsetting for the female one by one and finding the mean relatedness. I was thinking of doing some kind of for loop, but I'm not sure how to start it off.

Upvotes: 0

Views: 65

Answers (2)

sana elwaar
sana elwaar

Reputation: 1

The idea is:

  • to do a group by "Female.ID"
  • then summarize using the mean while ignoring the NA.

If the data is too large you may need to use a faster package like data.table (which is a fast package with a simple syntax). for more details please take a look at this link data.table vs dplyr: can one do something well the other can't or does poorly?

In general looping is not optimized in R. It can be kept as a final solution only if the treatment can't be supported by the package.

Here the syntax using data.table (df being the initial data.frame)

library(data.table)

dt<- as.data.table(df)
dt1 <- dt[, .(mean.relatedness= mean(relatedness, na.rm = TRUE)),
            by="Female.ID"]
>dt1
 Female.ID mean.relatedness
1:        A1        0.1687333
2:        B1        0.0345875

note that the grouping-by can be done over a multi-variables vector, the summarizing function can be other than the mean, and na.rm = TRUE is needed to ignore the NA while summarizing.

Upvotes: 0

crazybilly
crazybilly

Reputation: 3092

You could use dplyr:

library(dplyr)

themeans  <- df %>%
    group_by(Female.ID) %>%
    summarize(mean.relatedness = mean(relatedness, na.rm = T)

Upvotes: 4

Related Questions