Reputation: 23
VZ.Close CBOU.Close SBUX.Close T.Close
VZ.Close 1.0000000 0.5804478 0.8872978 0.9480894
CBOU.Close 0.5804478 1.0000000 0.7876277 0.4988890
SBUX.Close 0.8872978 0.7876277 1.0000000 0.8143305
T.Close 0.9480894 0.4988890 0.8143305 1.0000000
So, Let's say I have these correlations between stock prices. I would like to look at the first row and find the pair with the highest correlation. That would be VZ and T. I then want to remove those 2 stocks as options. Then, among the remaining stocks find the pair with the highest correlation. And so on until all stocks are paired. In this example it would obviously be CBOU and SBUX because they are the only 2 left, but I want the code able to accommodate any number of pairs.
Upvotes: 2
Views: 5312
Reputation: 59970
I think this answers your question, but I can't be sure as the original question is a bit ambiguous...
# Construct toy example of symmentrical matrix
# nc is number of rows/columns in matrix, in the problem above it was 4, but let's try with 6
nc <- 6
mat <- diag( 1 , nc )
# Create toy correlation data for matrix
dat <- runif( ( (nc^2-nc)/2 ) )
# Fill both triangles of matrix so it is symmetric
mat[lower.tri( mat ) ] <- dat
mat[upper.tri( mat ) ] <- dat
# Create vector of random string names for row/column names
names <- replicate( nc , expr = paste( sample( c( letters , LETTERS ) , 3 , replace = TRUE ) , collapse = "" ) )
dimnames(mat) <- list( names , names )
# Sanity check
mat
SXK llq xFL RVW oYQ Seb
SXK 1.000 0.973 0.499 0.585 0.813 0.751
llq 0.973 1.000 0.075 0.533 0.794 0.826
xFL 0.499 0.099 1.000 0.099 0.481 0.968
RVW 0.075 0.813 0.620 1.000 0.620 0.307
oYQ 0.585 0.794 0.751 0.968 1.000 0.682
Seb 0.533 0.481 0.826 0.307 0.682 1.000
# Ok - to problem at hand , you can just substitute your matrix into these lines:
# Clearly the diagonal in a correlation matrix will be 1 so this is excluded as per your problem
diag( mat ) <- NA
# Now find the next highest correlation in each row and set this to NA
mat <- t( apply( mat , 1 , function(x) { x[ which.max(x) ] <- NA ; return(x) } ) )
# Another sanity check...!
mat
SXK llq xFL RVW oYQ Seb
SXK NA NA 0.499 0.585 0.813 0.751
llq NA NA 0.075 0.533 0.794 0.826
xFL 0.499 0.099 NA 0.099 0.481 NA
RVW 0.075 NA 0.620 NA 0.620 0.307
oYQ 0.585 0.794 0.751 NA NA 0.682
Seb 0.533 0.481 NA 0.307 0.682 NA
# Now return the two remaining columns with greatest correlation in that row
res <- t( apply( mat , 1 , function(x) { y <- names( sort(x , TRUE ) )[1:2] ; return( y ) } ) )
res
[,1] [,2]
SXK "oYQ" "Seb"
llq "Seb" "oYQ"
xFL "SXK" "oYQ"
RVW "xFL" "oYQ"
oYQ "llq" "xFL"
Seb "oYQ" "SXK"
Does this answer your question?
Upvotes: 0
Reputation: 49033
Here is a solution if you want to look at the maximum correlation at each step. So the first step will not look only at the first row, but at the whole matrix.
Sample data :
d <- matrix(runif(36),ncol=6,nrow=6)
rownames(d) <- colnames(d) <- LETTERS[1:6]
diag(d) <- 1
d
A B C D E F
A 1.00000000 0.65209204 0.8520392 0.26980214 0.5844000 0.69335143
B 0.73531603 1.00000000 0.5499431 0.60511580 0.7483990 0.14788134
C 0.56433218 0.27242769 1.0000000 0.07952776 0.2147628 0.03711562
D 0.91756919 0.04853523 0.5554490 1.00000000 0.4344089 0.23381447
E 0.06897889 0.80740821 0.7974340 0.87425643 1.0000000 0.74546072
F 0.19961474 0.61665231 0.2829632 0.58110694 0.7433924 1.00000000
And the code :
results <- data.frame(v1=character(0), v2=character(0), cor=numeric(0), stringsAsFactors=FALSE)
diag(d) <- 0
while (sum(d>0)>1) {
maxval <- max(d)
max <- which(d==maxval, arr.ind=TRUE)[1,]
results <- rbind(results, data.frame(v1=rownames(d)[max[1]], v2=colnames(d)[max[2]], cor=maxval))
d[max[1],] <- 0
d[,max[1]] <- 0
d[max[2],] <- 0
d[,max[2]] <- 0
}
Which gives :
v1 v2 cor
1 D A 0.9175692
2 E B 0.8074082
3 F C 0.2829632
Upvotes: 4