Reputation: 13
I am trying to assess the stability of a correlation analysis by iteratively dropping a variable, and re-running the analysis.
As I understand it, this requires me to (1) create matrices of length p-1, by iteratively/sequentially dropping a variable from a dataframe, (2) run a correlation function over a series of matrices, and (3) feed the output into a common dataframe or list, for subsequent analysis.
I am able to achieve each of these steps manually, as follows:
#required library for cc function
library(CCA)
#set seed
set.seed(123)
#X and Y dataframes
X_df <- data.frame(replicate(4,sample(1:10,10,rep=TRUE)))
Y_df <- data.frame(replicate(3,sample(1:10,10,rep=TRUE)))
#X and Y as scaled matrices
X <- scale(X_df)
Y <- scale(Y_df)
#manually omit a variable/column from the X df
X1 <- scale(X_df[2:4])
X2 <- scale(X_df[c(1, 3:4)])
X3 <- scale(X_df[c(1:2, 4)])
X4 <- scale(X_df[1:3])
#manually omit a variable/column from the Y df
Y1 <- scale(Y_df[2:3])
Y2 <- scale(Y_df[c(1, 3)])
Y3 <- scale(Y_df[1:2])
#perform canonical correlation - X sets and Y
cX1 <- cc(X1,Y)$cor
cX2 <- cc(X2,Y)$cor
cX3 <- cc(X3,Y)$cor
cX4 <- cc(X4,Y)$cor
#perform canonical correlation - Y sets and X
cY1 <- cc(X,Y1)$cor
cY2 <- cc(X,Y2)$cor
cY3 <- cc(X,Y3)$cor
#get canonical correlation values into a df
XVALS <- as.data.frame(rbind(cX1, cX2, cX3, cX4))
YVALS <- as.data.frame(rbind(cY1, cY2, cY3))
Of course, I know it's very bad to do this manually, and my real data is much larger.
Unfortunately, I am pretty new to R (and coding), and have been struggling to achieve any of these steps in a better way. I am familiar with the (existence of) the apply functions and also some functions in dplyr that I think are likely relevant (e.g., select
) but I just can't get it to work despite reading documentation and seemingly similar posts for hours -- any guidance would be greatly appreciated.
Upvotes: 1
Views: 42
Reputation: 76641
scale
.First of all, there is no need for scaled vectors as the code below shows.
The reason why vectors are scaled is a variant of R FAQ 7.31, see also this SO post.
With older processors the precision loss was a real problem, leading to clearly wrong results. This is no longer true, at least not in the general case.
#perform canonical correlation - original X sets and Y
cX1b <- cc(X_df[2:4], Y)$cor
cX2b <- cc(X_df[c(1, 3:4)], Y)$cor
cX3b <- cc(X_df[c(1:2, 4)], Y)$cor
cX4b <- cc(X_df[1:3], Y)$cor
XVALSb <- as.data.frame(rbind(cX1b, cX2b, cX3b, cX4b))
XVALS
and XVALSb
row names are different, make them equal in order to please all.equal()
.
row.names(XVALS) <- 1:4
row.names(XVALSb) <- 1:4
The results are not exactly equal but are within floating-point accuracy. In this case I'm testing equality with all.equal
's default of .Machine$double.eps^0.5
.
identical(XVALS, XVALSb)
#[1] FALSE
all.equal(XVALS, XVALSb)
#[1] TRUE
XVALS - XVALSb
# V1 V2 V3
#1 0.000000e+00 1.110223e-16 0.000000e+00
#2 -1.110223e-16 1.110223e-16 5.551115e-17
#3 1.110223e-16 -2.220446e-16 2.220446e-16
#4 1.110223e-16 4.440892e-16 1.110223e-16
To get all combinations of columns leaving one out there is function combn
.
Function cc_df_one_out
first calls combn
on each of its arguments then apply
to those indices an anonymous function computing CCA::cc
.
Note that the rows order is not the same as in your posted example, since combn
does not follow your order of column indices.
cc_df_one_out <- function(X, Y){
f <- function(x) combn(ncol(x), ncol(x) - 1)
X_inx <- f(X)
Y_inx <- f(Y)
ccX <- t(apply(X_inx, 2, function(i) cc(X[, i], Y)$cor))
ccY <- t(apply(Y_inx, 2, function(i) cc(X, Y[, i])$cor))
list(XVALS = as.data.frame(ccX), YVALS = as.data.frame(ccY))
}
cc_df_one_out(X_df, Y_df)
#$XVALS
# V1 V2 V3
#1 0.8787169 0.6999526 0.5073979
#2 0.8922514 0.7244302 0.2979096
#3 0.8441566 0.7807032 0.3331449
#4 0.9059585 0.7371382 0.1344559
#
#$YVALS
# V1 V2
#1 0.8975949 0.7309265
#2 0.8484323 0.7488632
#3 0.8721945 0.7452478
Upvotes: 0