chipsin
chipsin

Reputation: 675

Loop for calculating correlation coefficient for selected groups

I have numerous data frames with numerous variable pairs, for which I would like to calculate the correlation coefficient. I have been able to write a function and a mapply function to automate this process, but this only works for the entire dataset, not a defined subset of the dataset.

Below is an example dataset.

library(dplyr)
df <- data.frame("ID" = 1:16)
df$VarA <- c(1,1,1,1,1,1,1,1,1,1,1,14,14,14,14,16)
df$VarB <- c(10,0,0,0,12,12,12,12,0,14,14,14,16,16,16,16)
df$VarC <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$VarD <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$New_VarA <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$New_VarB <- c(10,0,0,0,12,12,12,12,0,14,14,14,16,16,16,16)
df$New_VarC <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$New_VarD <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$ControlVarA <- factor(c("Group_1","Group_1","Group_1","Group_1","Group_1", "Group_1",
                           "Group_2","Group_2","Group_2","Group_2","Group_2","Group_2",
                           "Group_2","Group_2","Group_2","Group_2")) 
df$ControlVarB <- factor(c("Group_1","Group_1","Group_1","Group_1","Group_1", "Group_1",
                           "Group_2","Group_2","Group_2","Group_2","Group_2","Group_2",
                           "Group_2","Group_2","Group_2","Group_2")) 
df$ControlVarC <- factor(c("Group_1","Group_1","Group_1","Group_1","Group_1", "Group_1",
                           "Group_2","Group_2","Group_2","Group_2","Group_2","Group_2",
                           "Group_2","Group_2","Group_2","Group_2")) 
df$ControlVarD <- factor(c("Group_1","Group_1","Group_1","Group_1","Group_1", "Group_1",
                           "Group_2","Group_2","Group_2","Group_2","Group_2","Group_2",
                           "Group_2","Group_2","Group_2","Group_2")) 

I have written a function to calculate R2, and a function for calculating R2 across variable pairs (which I refer to as R2_df)using the code below. I then define the variable lists and use the mapply function to successfully calculate the R2 values for each pair of variables.

R2 = function(y_actual,y_predict){
  cor(y_actual,y_predict)^2
}

R2_df <- function(dataset, x, y){
  R2(dataset[[x]], dataset[[y]])
}

Var_list <- df %>% select(starts_with("Var")) %>% colnames()
New_Var_list <- df %>% select(starts_with("New")) %>% colnames()
Control_list <- df %>% select(starts_with("ControlVar")) %>% colnames()

mapply(R2_df, Var_list, New_Var_list, MoreArgs = list(dataset = df))

As I am only interested in obtaining the correlation coefficients for "Group_B" I have modified the R2_df function to account for filtering the data frame based on the presence of "Group_2". This updated code is provided below. I do not receive any errors when I run this updated code, but all of the R2 values are now NA.

R2 = function(y_actual,y_predict){
  cor(y_actual,y_predict)^2
}

R2_df <- function(dataset, x, y, z){
  
  r2_dataset <- dataset %>% filter(z == "Group_2")
  
  R2(r2_dataset[[x]], r2_dataset[[y]])
  
}
    
Var_list <- df %>% select(starts_with("Var")) %>% colnames()
New_Var_list <- df %>% select(starts_with("New")) %>% colnames()
Control_list <- df %>% select(starts_with("ControlVar")) %>% colnames()

mapply(R2_df, Var_list, New_Var_list, Control_list, MoreArgs = list(dataset = df))

Upvotes: 0

Views: 220

Answers (1)

Bj&#246;rn
Bj&#246;rn

Reputation: 1822

Updated Answer, after Clarification in Comments

The problem is, the filter( z == "Group_2") call. Because z does not get replaced by the character it contains (in the first case ControlVarA) but instead dplyr will try to find a column called z in your data.frame.

You can solve that by specifing to look for the character type column-name saved in z through indexing with: [] the data passed by the %>% pipe ., e.g.:
r2_dataset <- dataset %>% filter(.[z] == "Group_2")

library(dplyr)

R2 = function(y_actual,y_predict){
  cor(y_actual,y_predict)^2
}



R2_df <- function( x, y, z, dataset){
  r2_dataset <- dataset %>% filter(.[z] == "Group_2")
  R2(r2_dataset[[x]], r2_dataset[[y]])
  
}

Var_list <- df %>% select(starts_with("Var")) %>% colnames()
New_Var_list <- df %>% select(starts_with("New")) %>% colnames()
Control_list <- df %>% select(starts_with("ControlVar")) %>% colnames()

mapply(R2_df, Var_list, New_Var_list ,Control_list, MoreArgs = list(dataset = df))

Output

      VarA       VarB       VarC       VarD 
0.01513781 1.00000000 1.00000000 1.00000000

Upvotes: 1

Related Questions