Reputation: 675
I have numerous data frames with numerous variable pairs, for which I would like to calculate the correlation coefficient. I have been able to write a function
and a mapply
function to automate this process, but this only works for the entire dataset, not a defined subset of the dataset.
Below is an example dataset.
library(dplyr)
df <- data.frame("ID" = 1:16)
df$VarA <- c(1,1,1,1,1,1,1,1,1,1,1,14,14,14,14,16)
df$VarB <- c(10,0,0,0,12,12,12,12,0,14,14,14,16,16,16,16)
df$VarC <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$VarD <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$New_VarA <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$New_VarB <- c(10,0,0,0,12,12,12,12,0,14,14,14,16,16,16,16)
df$New_VarC <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$New_VarD <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$ControlVarA <- factor(c("Group_1","Group_1","Group_1","Group_1","Group_1", "Group_1",
"Group_2","Group_2","Group_2","Group_2","Group_2","Group_2",
"Group_2","Group_2","Group_2","Group_2"))
df$ControlVarB <- factor(c("Group_1","Group_1","Group_1","Group_1","Group_1", "Group_1",
"Group_2","Group_2","Group_2","Group_2","Group_2","Group_2",
"Group_2","Group_2","Group_2","Group_2"))
df$ControlVarC <- factor(c("Group_1","Group_1","Group_1","Group_1","Group_1", "Group_1",
"Group_2","Group_2","Group_2","Group_2","Group_2","Group_2",
"Group_2","Group_2","Group_2","Group_2"))
df$ControlVarD <- factor(c("Group_1","Group_1","Group_1","Group_1","Group_1", "Group_1",
"Group_2","Group_2","Group_2","Group_2","Group_2","Group_2",
"Group_2","Group_2","Group_2","Group_2"))
I have written a function
to calculate R2, and a function for calculating R2 across variable pairs (which I refer to as R2_df
)using the code below. I then define the variable lists and use the mapply
function to successfully calculate the R2 values for each pair of variables.
R2 = function(y_actual,y_predict){
cor(y_actual,y_predict)^2
}
R2_df <- function(dataset, x, y){
R2(dataset[[x]], dataset[[y]])
}
Var_list <- df %>% select(starts_with("Var")) %>% colnames()
New_Var_list <- df %>% select(starts_with("New")) %>% colnames()
Control_list <- df %>% select(starts_with("ControlVar")) %>% colnames()
mapply(R2_df, Var_list, New_Var_list, MoreArgs = list(dataset = df))
As I am only interested in obtaining the correlation coefficients for "Group_B" I have modified the R2_df
function to account for filtering the data frame based on the presence of "Group_2". This updated code is provided below. I do not receive any errors when I run this updated code, but all of the R2 values are now NA.
R2 = function(y_actual,y_predict){
cor(y_actual,y_predict)^2
}
R2_df <- function(dataset, x, y, z){
r2_dataset <- dataset %>% filter(z == "Group_2")
R2(r2_dataset[[x]], r2_dataset[[y]])
}
Var_list <- df %>% select(starts_with("Var")) %>% colnames()
New_Var_list <- df %>% select(starts_with("New")) %>% colnames()
Control_list <- df %>% select(starts_with("ControlVar")) %>% colnames()
mapply(R2_df, Var_list, New_Var_list, Control_list, MoreArgs = list(dataset = df))
Upvotes: 0
Views: 220
Reputation: 1822
The problem is, the filter( z == "Group_2")
call. Because z does not get replaced by the character it contains (in the first case ControlVarA
) but instead dplyr will try to find a column called z
in your data.frame.
You can solve that by specifing to look for the character type column-name saved in z
through indexing with: []
the data passed by the %>%
pipe .
, e.g.:
r2_dataset <- dataset %>% filter(.[z] == "Group_2")
library(dplyr)
R2 = function(y_actual,y_predict){
cor(y_actual,y_predict)^2
}
R2_df <- function( x, y, z, dataset){
r2_dataset <- dataset %>% filter(.[z] == "Group_2")
R2(r2_dataset[[x]], r2_dataset[[y]])
}
Var_list <- df %>% select(starts_with("Var")) %>% colnames()
New_Var_list <- df %>% select(starts_with("New")) %>% colnames()
Control_list <- df %>% select(starts_with("ControlVar")) %>% colnames()
mapply(R2_df, Var_list, New_Var_list ,Control_list, MoreArgs = list(dataset = df))
VarA VarB VarC VarD
0.01513781 1.00000000 1.00000000 1.00000000
Upvotes: 1