PeterG
PeterG

Reputation: 23

Simplify R code for computing group statistics across multiple factors

I am trying to compute the variance for each group in a data set with multiple factors. For example, the data set below is the first 6 lines of a data frame with 5 columns: 4 factors of two levels each (No and Yes) and 1 continuous variable:

Factor A Factor B Factor C Factor D VarX
Yes Yes Yes No 66.8
No Yes Yes No 66.0
Yes No No No 58.4
No Yes Yes Yes 68.3
Yes Yes Yes No 61.8
Yes No No No 67.3

What I want to do is produce a summary table such as the one below:

Factor SD (NO) SD (YES) SD Ratio
Factor A 3.79 3.51 1.08
Factor B 3.44 3.83 1.11
Factor C 3.77 3.53 1.07
Factor D 3.92 3.32 1.18

For each factor, I have calculated the standard deviation at each level ("No" and "Yes") as well as the ratio of the two standard deviations.

Here is the code I am using to do this:

#
# Define modify function for SD ratio column
#
sd_ratio<-function(x,y){
  return(max(x,y)/min(x,y))
}
#
# Set up storage
#
nc<-4 # number of factors in data
testDataSum<-tibble(SD_No=rep(NA,nc),
                   SD_Yes=rep(NA,nc),
                   SD_Ratio=rep(NA,nc))
#
Factor<-vector("list",4)
SDList<-vector("list",4)
#
# For Loop. Group data by factors 1,2,3,4
#
for (i in 1:4){
  Factor[[i]]<-names(testData[,i])
  SDList[[i]]<-testData %>% 
  group_by(testData[,i])%>%
  summarize(SD=sd(VarX))
}
# Load summary DF with data by unlisting SDList
#
testDataSum$SD_No<-as.vector(matrix(unlist(SDList),ncol=4,byrow=T)[,3])
testDataSum$SD_Yes<-as.vector(matrix(unlist(SDList),ncol=4,byrow=T)[,4])
testDataSum$SD_Ratio=modify2(testDataSum$SD_No,testDataSum$SD_Yes,sd_ratio)
#
# Load formatted factor names and put it at the front
#
testDataSum<-testDataSum %>% 
  mutate(Factor=unlist(Factor)) %>%
  relocate(Factor)
# Show results
testDataSum

My request is for help in simplifying this code. This works but it seems horribly ugly and complex, not to mention difficult to come back to at a later date and modify. I believe there is a much simpler way to do it without a for-loop, and without the ungainly process of unlisting SDList using the "as.vector (matrix (..." lines. I have reviewed the documentation for DPLYR and PLYR, especially the grouping section, but I am baffled. Any suggestions are much appreciated.

Here is a link to a github repository with the code and a csv file with 192 rows that you can use to produce the result table.

Git Hub Link for code and Data

Upvotes: 0

Views: 52

Answers (1)

Kra.P
Kra.P

Reputation: 15143

You may try using reshape2, dplyr, and tidyr

When I read your data, column names get broken, so I rename them beforehand.

library(dplyr)
library(tidyr)
library(reshape2)

names(df) <- c("A","B","C","D","VarX")


df %>%
  melt(id.vars = "VarX", variable.name = "Factor") %>%
  group_by(Factor, value) %>%
  summarize(sd = sd(VarX)) %>%
  pivot_wider(id_cols = Factor, values_from = sd, names_from = value, names_glue = "sd_{value}") %>%
  mutate(SD_ratio = pmax(sd_No,sd_Yes)/pmin(sd_No,sd_Yes))

  Factor sd_No sd_Yes SD_ratio
  <fct>  <dbl>  <dbl>    <dbl>
1 A       3.51   3.79     1.08
2 B       3.83   3.44     1.11
3 C       3.53   3.77     1.07
4 D       3.92   3.32     1.18

Upvotes: 1

Related Questions