Reputation: 37
I have a dataset with 5 levels of a treatment for a response variable.
Assume, I measured soil N content at 5 levels (optimal, 40%, 30%, 20%, and 10%) of soil water content. And for each level I have 5 replicates.
Now, I would like to calculate unstandardized (optimal - 40%, optimal - 30%, optimal - 20%, optimal - 10%) and standardized (optimal - 40% / optimal, optimal - 30% / optimal, and so on) for each replicate.
Is there any way to do this in R with Tidyverse? I am having problems with making 'loop' functions. Five replicates for each treatment level.
df<- data.frame(Soilwater = c("optimal", "optimal", "optimal", "optimal", "optimal",
"40", "40", "40", "40", "40",
"30","30","30","30","30",
"20", "20","20","20","20",
"10","10","10","10","10",
"optimal", "optimal", "optimal", "optimal", "optimal",
"40", "40", "40", "40", "40",
"30","30","30","30","30",
"20", "20","20","20","20",
"10","10","10","10","10"),
Diversity = c("High","High","High","High","High","High","High","High","High","High", "High","High","High","High","High","High","High","High","High","High",
"High","High","High","High","High",
"Low", "Low", "Low","Low","Low","Low","Low","Low","Low","Low",
"Low","Low","Low","Low","Low","Low","Low","Low","Low","Low",
"Low","Low","Low","Low","Low"),
Soil_N = c(50,45, 49, 48, 49, 69, 68, 69, 70, 67, 79, 78, 79, 78, 77, 89, 89, 87, 88, 89, 99, 98, 97, 98, 98, 120,
121, 121, 120, 122, 134, 131, 132, 134, 131, 145, 148, 149, 147,
148, 159, 159, 157, 156, 157, 169, 167, 167, 168, 164))
I used the code below that was suggested by @JonSpring which was really helpful.
df %>%
# First, we can add a `Replicate` number based on position within
# each Soilwater/Diversity cohort.
group_by(Soilwater, Diversity) %>%
mutate(Replicate = row_number()) %>%
# Calc diff vs. experiment with same Diversity & Replicate, optimal Soilwater
group_by(Diversity, Replicate) %>%
mutate(Difference = Soil_N - Soil_N[Soilwater == "optimal"]) %>%
# Summarize avg diffs
group_by(Soilwater, Diversity) %>%
summarize(Mean_Diff = mean(Difference), .groups = "drop")
However, I realized that first I need to make an average for the optimal
Soilwater level and then calculate the difference between this average and each replicate from other Soilwater levels for which I tried the code below (with the mean
function to calculate the average of optimal
soilwater before the difference). But it is not working.
df%>% group_by(Soilwater, Diversity)%>% mutate(Replicate = row_number())%>%
group_by(Diversity, Replicate)%>% mutate(Difference = mean(Soil_N[Soilwater=="optimal"])- Soil_N)
Upvotes: 0
Views: 175
Reputation: 2288
It is difficult to understand your problem. So I start from what seems to work with you.
As a newcomer to R and tidyverse, please be cognisant that the %>%
(pipe) chains your operations on the (starting) object.
You can assign any state/stage of your operations to a new object (aka variable).
I further recommend that you create several "interim" objects as you work your problems to store steps of your problem/algorithm. This will give you a better feel of what you have. Over time you will then get enough experience to chain the operations and avoid some of the - interim - stages/object.
For that purpose, I introduce an "interim" result/object as your description suggested this worked - up to that point for you, i.e. I assign interim_df <- ...
library(dplyr)
interim_df <- df %>%
group_by(Soilwater, Diversity) %>%
mutate(Replicate = row_number()) %>%
group_by(Diversity, Replicate)
This yields an object interim_df
. Let's have a look at it
interim_df
# A tibble: 50 x 4
# Groups: Diversity, Replicate [10]
Soilwater Diversity Soil_N Replicate
<chr> <chr> <dbl> <int>
1 optimal High 50 1
2 optimal High 45 2
3 optimal High 49 3
4 optimal High 48 4
5 optimal High 49 5
6 40 High 69 1
7 40 High 68 2
8 40 High 69 3
9 40 High 70 4
10 40 High 67 5
Ok. We got a tibble 50 rows with 4 variables ... that seem to be the datastructure your are happy with.
What you also have is a "grouped dataframe". Be sure to ungroup()
when you want to operate on the whole (or other part of the dataframe).
interim_df <- interim_df %>% ungroup()
You can "extract" your "optimal" measurements and calculate the average over this "new" df/tibble.
mean_optimal <- interim_df %>%
filter(Soilwater == "optimal") %>%
summarise(MeanOptimal = mean(SoilN) # we calculate/summarise the mean over the part we want
This gives you
# A tibble: 1 x 1
MeanOptimal
<dbl>
1 84.5
To be clear, we have now another tibble with 1 variable/column.
This can be used in your interim_df
. However, make sure to understand how to "extract" a column from a tibble (aka make it a vector to reuse).
The base-R notation $
gives you direct access to a column (vector); tidyverse offers the pull()
function.
final <- interim_df %>% mutate(Difference = mean_optimal$MeanOptimal - Soil_N)
final
# A tibble: 50 x 5
Soilwater Diversity Soil_N Replicate Difference
<chr> <chr> <dbl> <int> <dbl>
1 optimal High 50 1 34.5
2 optimal High 45 2 39.5
3 optimal High 49 3 35.5
4 optimal High 48 4 36.5
5 optimal High 49 5 35.5
6 40 High 69 1 15.5
7 40 High 68 2 16.5
8 40 High 69 3 15.5
9 40 High 70 4 14.5
10 40 High 67 5 17.5
You can also "add" your mean_optimal$MeanOptimal to the interim_df as a new column by interim_df %>% mutate(MeanOptimal = mean_optimal$MeanOptimal)
and then do the difference.
Upvotes: 3