Reputation: 131
I am struggling with a programming issue. The goal is to find the optimal mean aggregate of multiple columns, such that that aggregate column maximizes the correlation with another column.
As a toy example, consider the following data:
set.seed(123)
df <- cbind(data.frame(FIRM = rnorm(36, 0, 0.05)),
data.frame(matrix(rnorm(36 * 50, 0, 0.05), 36, 50)))
Thus: the goal is to find the optimal combination of the 50 "X" columns such that when taking rowmeans of these columns the correlation with the "FIRM" column is maximized. However, altough this is only a small example, there are already 50 factorial possible combinations.
Part of the issue is that the real dataset is much larger (i.e., around 20,000 "FIRM" optimizations, with over 5,000 "X" possible columns to aggregate per optimization). Up until now, I started with some stepwise optimizations, but given the size of the dataset this is very inefficient. I was hoping to get some insights in a better way of coding this issue.
So far I wrote some code that takes any linear combination and maximizes that correlation with the initial column. However, I would now like to adjust this, such that the code does not weigh the columns by any other weights than 1 or 0.
The code I have so far is:
set.seed(123)
firm <- rnorm(36, 0, 0.05)
peers <- matrix(rnorm(36 * 50, 0, 0.05), 36, 50)
#Function to maximize
cor.model <- function(w = rep(1 / ncol(peers), ncol(peers))){
f_score <- peers %*% (w / sum(w))
x <- f_score
y <- firm
correl <- cor(x,y)
return(correl)
}
#Output
out <- optim(par = rep(1 / ncol(peers), ncol(peers)),
fn = cor.model,
method = "L-BFGS-B",
lower = rep(0, ncol(peers)), # W_i >= 0 for all i
upper = rep(1, ncol(peers)), # W_i <= 1 for all i)
control = list(fnscale = -1))
out$par/sum(out$par)
cor(firm, rowSums(peers))
cor(firm, rowSums(t(as.vector(out$par/sum(out$par))*t(peers))))
Thanks a lot!
Upvotes: 2
Views: 444
Reputation: 1493
I would try a Local Search (as described in this tutorial).
Here is a sketch in R.
FIRM <- as.matrix(df[[1]])
M <- as.matrix(df[, -1])
library("neighbours") ## https://github.com/enricoschumann/neighbours
library("NMOF") ## https://github.com/enricoschumann/NMOF
N <- neighbourfun(type = "logical", kmin = 1, kmax = 50)
An initial solution.
x <- logical(50)
x[1:5] <- TRUE
The objective function. The function we use later minimises, so I put a minus in front of the computation.
of_cor <- function(x, FIRM, M) {
-c(cor(FIRM, rowMeans(M[, x])))
}
Test: select all 50 columns.
-of_cor(!logical(50), FIRM, M)
## [1] -0.1727944
Test: use the initial solution.
-of_cor(x, FIRM, M)
## [1] -0.2261783
Run the actual computation, with Threshold Accepting (which is based on Local Search).
sol <- TAopt(of_cor,
list(x0 = x,
neighbour = N,
nI = 50000),
M = M,
FIRM = FIRM)
## Threshold Accepting
## [....]
## Finished.
## Best solution overall: -0.6206239
The solution has a correlation of 0.62.
-of_cor(sol$xbest, FIRM, M)
## [1] 0.6206239
(Disclosure: I am the maintainer of the packages that I used.)
Upvotes: 3