Reputation: 23
I am trying to compute the variance for each group in a data set with multiple factors. For example, the data set below is the first 6 lines of a data frame with 5 columns: 4 factors of two levels each (No and Yes) and 1 continuous variable:
Factor A | Factor B | Factor C | Factor D | VarX |
---|---|---|---|---|
Yes | Yes | Yes | No | 66.8 |
No | Yes | Yes | No | 66.0 |
Yes | No | No | No | 58.4 |
No | Yes | Yes | Yes | 68.3 |
Yes | Yes | Yes | No | 61.8 |
Yes | No | No | No | 67.3 |
What I want to do is produce a summary table such as the one below:
Factor | SD (NO) | SD (YES) | SD Ratio |
---|---|---|---|
Factor A | 3.79 | 3.51 | 1.08 |
Factor B | 3.44 | 3.83 | 1.11 |
Factor C | 3.77 | 3.53 | 1.07 |
Factor D | 3.92 | 3.32 | 1.18 |
For each factor, I have calculated the standard deviation at each level ("No" and "Yes") as well as the ratio of the two standard deviations.
Here is the code I am using to do this:
#
# Define modify function for SD ratio column
#
sd_ratio<-function(x,y){
return(max(x,y)/min(x,y))
}
#
# Set up storage
#
nc<-4 # number of factors in data
testDataSum<-tibble(SD_No=rep(NA,nc),
SD_Yes=rep(NA,nc),
SD_Ratio=rep(NA,nc))
#
Factor<-vector("list",4)
SDList<-vector("list",4)
#
# For Loop. Group data by factors 1,2,3,4
#
for (i in 1:4){
Factor[[i]]<-names(testData[,i])
SDList[[i]]<-testData %>%
group_by(testData[,i])%>%
summarize(SD=sd(VarX))
}
# Load summary DF with data by unlisting SDList
#
testDataSum$SD_No<-as.vector(matrix(unlist(SDList),ncol=4,byrow=T)[,3])
testDataSum$SD_Yes<-as.vector(matrix(unlist(SDList),ncol=4,byrow=T)[,4])
testDataSum$SD_Ratio=modify2(testDataSum$SD_No,testDataSum$SD_Yes,sd_ratio)
#
# Load formatted factor names and put it at the front
#
testDataSum<-testDataSum %>%
mutate(Factor=unlist(Factor)) %>%
relocate(Factor)
# Show results
testDataSum
My request is for help in simplifying this code. This works but it seems horribly ugly and complex, not to mention difficult to come back to at a later date and modify. I believe there is a much simpler way to do it without a for-loop, and without the ungainly process of unlisting SDList using the "as.vector (matrix (..." lines. I have reviewed the documentation for DPLYR and PLYR, especially the grouping section, but I am baffled. Any suggestions are much appreciated.
Here is a link to a github repository with the code and a csv file with 192 rows that you can use to produce the result table.
Git Hub Link for code and Data
Upvotes: 0
Views: 52
Reputation: 15143
You may try using reshape2
, dplyr
, and tidyr
When I read your data, column names get broken, so I rename them beforehand.
library(dplyr)
library(tidyr)
library(reshape2)
names(df) <- c("A","B","C","D","VarX")
df %>%
melt(id.vars = "VarX", variable.name = "Factor") %>%
group_by(Factor, value) %>%
summarize(sd = sd(VarX)) %>%
pivot_wider(id_cols = Factor, values_from = sd, names_from = value, names_glue = "sd_{value}") %>%
mutate(SD_ratio = pmax(sd_No,sd_Yes)/pmin(sd_No,sd_Yes))
Factor sd_No sd_Yes SD_ratio
<fct> <dbl> <dbl> <dbl>
1 A 3.51 3.79 1.08
2 B 3.83 3.44 1.11
3 C 3.53 3.77 1.07
4 D 3.92 3.32 1.18
Upvotes: 1