calculating mean in differences in grouped variables

Question

I have a dataset with 5 levels of a treatment for a response variable.

Assume, I measured soil N content at 5 levels (optimal, 40%, 30%, 20%, and 10%) of soil water content. And for each level I have 5 replicates.

Now, I would like to calculate unstandardized (optimal - 40%, optimal - 30%, optimal - 20%, optimal - 10%) and standardized (optimal - 40% / optimal, optimal - 30% / optimal, and so on) for each replicate.

Is there any way to do this in R with Tidyverse? I am having problems with making 'loop' functions. Five replicates for each treatment level.

df<- data.frame(Soilwater = c("optimal", "optimal", "optimal", "optimal", "optimal", 
       "40", "40", "40", "40", "40", 
       "30","30","30","30","30", 
       "20", "20","20","20","20",
       "10","10","10","10","10", 
       "optimal", "optimal", "optimal", "optimal", "optimal", 
       "40", "40", "40", "40", "40", 
       "30","30","30","30","30", 
       "20", "20","20","20","20",
       "10","10","10","10","10"), 
Diversity = c("High","High","High","High","High","High","High","High","High","High",   "High","High","High","High","High","High","High","High","High","High",
       "High","High","High","High","High", 
       "Low", "Low", "Low","Low","Low","Low","Low","Low","Low","Low",
       "Low","Low","Low","Low","Low","Low","Low","Low","Low","Low",
       "Low","Low","Low","Low","Low"),
Soil_N = c(50,45, 49, 48, 49, 69, 68, 69, 70, 67, 79, 78, 79, 78, 77, 89, 89, 87, 88, 89, 99, 98, 97, 98, 98, 120,    
   121,    121,    120,    122,    134,    131,    132,    134,    131,    145,    148,    149,    147,    
   148,    159,    159,    157,    156,    157,    169,    167,    167,    168,    164))

I used the code below that was suggested by @JonSpring which was really helpful.

df %>%
    # First, we can add a `Replicate` number based on position within 
    # each Soilwater/Diversity cohort.
    group_by(Soilwater, Diversity) %>%
    mutate(Replicate = row_number()) %>%

    # Calc diff vs. experiment with same Diversity & Replicate, optimal Soilwater 
    group_by(Diversity, Replicate) %>%
    mutate(Difference = Soil_N - Soil_N[Soilwater == "optimal"]) %>%

    # Summarize avg diffs
    group_by(Soilwater, Diversity) %>%
    summarize(Mean_Diff = mean(Difference), .groups = "drop")

However, I realized that first I need to make an average for the optimal Soilwater level and then calculate the difference between this average and each replicate from other Soilwater levels for which I tried the code below (with the mean function to calculate the average of optimal soilwater before the difference). But it is not working.

df%>%       group_by(Soilwater, Diversity)%>%       mutate(Replicate = row_number())%>%        
group_by(Diversity, Replicate)%>%       mutate(Difference = mean(Soil_N[Soilwater=="optimal"])- Soil_N)

Ray · Accepted Answer

It is difficult to understand your problem. So I start from what seems to work with you.

As a newcomer to R and tidyverse, please be cognisant that the %>% (pipe) chains your operations on the (starting) object.
You can assign any state/stage of your operations to a new object (aka variable).

I further recommend that you create several "interim" objects as you work your problems to store steps of your problem/algorithm. This will give you a better feel of what you have. Over time you will then get enough experience to chain the operations and avoid some of the - interim - stages/object.

For that purpose, I introduce an "interim" result/object as your description suggested this worked - up to that point for you, i.e. I assign interim_df <- ...

library(dplyr)

interim_df <- df %>%
    group_by(Soilwater, Diversity)  %>%
    mutate(Replicate = row_number()) %>%
    group_by(Diversity, Replicate)

This yields an object interim_df. Let's have a look at it

interim_df
# A tibble: 50 x 4
# Groups:   Diversity, Replicate [10]
   Soilwater Diversity Soil_N Replicate
                   
 1 optimal   High          50         1
 2 optimal   High          45         2
 3 optimal   High          49         3
 4 optimal   High          48         4
 5 optimal   High          49         5
 6 40        High          69         1
 7 40        High          68         2
 8 40        High          69         3
 9 40        High          70         4
10 40        High          67         5

Ok. We got a tibble 50 rows with 4 variables ... that seem to be the datastructure your are happy with. What you also have is a "grouped dataframe". Be sure to ungroup() when you want to operate on the whole (or other part of the dataframe).

interim_df <- interim_df %>% ungroup()

You can "extract" your "optimal" measurements and calculate the average over this "new" df/tibble.

mean_optimal <- interim_df %>%
    filter(Soilwater == "optimal") %>%
    summarise(MeanOptimal = mean(SoilN)   # we calculate/summarise the mean over the part we want

This gives you

# A tibble: 1 x 1
  MeanOptimal
        
1        84.5

To be clear, we have now another tibble with 1 variable/column. This can be used in your interim_df. However, make sure to understand how to "extract" a column from a tibble (aka make it a vector to reuse). The base-R notation $ gives you direct access to a column (vector); tidyverse offers the pull() function.

final <- interim_df %>% mutate(Difference = mean_optimal$MeanOptimal - Soil_N)
final
# A tibble: 50 x 5
   Soilwater Diversity Soil_N Replicate Difference
                         
 1 optimal   High          50         1       34.5
 2 optimal   High          45         2       39.5
 3 optimal   High          49         3       35.5
 4 optimal   High          48         4       36.5
 5 optimal   High          49         5       35.5
 6 40        High          69         1       15.5
 7 40        High          68         2       16.5
 8 40        High          69         3       15.5
 9 40        High          70         4       14.5
10 40        High          67         5       17.5

You can also "add" your mean_optimal$MeanOptimal to the interim_df as a new column by interim_df %>% mutate(MeanOptimal = mean_optimal$MeanOptimal) and then do the difference.

calculating mean in differences in grouped variables

Answers (1)

Related Questions