Reputation: 28159
I'm trying to use linear regression to figure out the best weighting for 3 models to predict an outcome. So there are 3 variables (x1, x2, x3)
that are the predictions of the dependent variable, y
. My question is, how do I run a regression with the constraint that the sum of the coefficients sum to 1. For example:
this is good:
y = .2(x1) + .4(x2) + .4(x3)
since .2 + .4 + .4 = 1
this is no good:
y = 1.2(x1) + .4(x2) + .3(x3)
since 1.2 + .4 + .3 > 1
I'm looking to do this in R if possible. Thanks. Let me know if this needs to get moved to the stats area ('Cross-Validated').
EDIT:
The problem is to classify each row as 1 or 0. y is the actual values ( 0 or 1 ) from the training set, x1 is the predicted values from a kNN model, x2 is from a randomForest, x3 is from a gbm model. I'm trying to get the best weightings for each model, so each coefficient is <=1 and the sum of the coefficients == 1. Would look something like this:
y/Actual value knnPred RfPred gbmPred
0 .1111 .0546 .03325
1 .7778 .6245 .60985
0 .3354 .1293 .33255
0 .2235 .9987 .10393
1 .9888 .6753 .88933
... ... ... ...
The measure for success is AUC. So I'm trying to set the coefficients to maximize AUC while making sure they sum to 1.
Upvotes: 3
Views: 5302
Reputation: 269644
For the five rows shown either of round(knnPred)
or round(gbmPred)
give perfect predictions so there is some question whether more than one predictor is needed.
At any rate, to solve the given question as stated the following will give nonnegative coefficients that sum to 1 (except possibly for tiny differences due to computer arithmetic). a
is the dependent variable and b
is a matrix of independent variables. c
and d
define the equality constraint (coeffs sum to 1) and e
and f
define the inequality constraints (coeffs are nonnegative).
library(lsei)
a <- cbind(x1, x2, x3)
b <- y
c <- matrix(c(1, 1, 1), 1)
d <- 1
e <- diag(3)
f <- c(0, 0, 0)
lsei(a, b, c, d, e, f)
Upvotes: 0
Reputation: 263352
No data to test on:
mod1 <- lm(y ~ 0+x1+x2+x3, data=dat)
mod2 <- lm(y/I(sum(coef(mod1))) ~ 0+x1+x2+x3, data=dat)
And now that I think about it some more, skip mod2, just:
coef(mod1)/sum(coef(mod1))
Upvotes: 2
Reputation: 145775
There's very likely a better way that someone else will share, but you're looking for two parameters such that
b1 * x1 + b2 * x2 + (1 - b1 - b2) * x3
is close to y
. To do that, I'd write an error function to minimize
minimizeMe <- function(b, x, y) { ## Calculates MSE
mean((b[1] * x[, 1] + b[2] * x[, 2] + (1 - sum(b)) * x[, 3] - y) ^ 2)
}
and throw it to optim
fit <- optim(par = c(.2, .4), fn = minimizeMe, x = cbind(x1, x2, x3), y = y)
Upvotes: 6