Reputation: 542
This is part of a large matrix (dimension around: 1'000-1'000'000 rows x 100 - 1'000 columns):
scen_1 scen_2 scen_3 scen_4 ...
...
9 3.262275 0.000000 0.00000 0.0000000 ...
10 2.843631 0.000000 1.22636 1.0559217 ...
11 0.000000 0.000000 0.00000 0.9836209 ...
12 2.572686 0.000000 0.00000 1.1000293 ...
13 0.000000 0.000000 0.00000 0.0000000 ...
14 0.611070 1.478159 0.00000 0.0000000 ...
15 0.000000 0.000000 0.00000 0.0000000 ...
16 0.000000 0.000000 0.00000 1.0146529 ...
...
Now, I want to select n rows, which - after getting for each column the maximum - have the highest sum, thus row that complement each other well. E.g. I select row 9 and 10 I get the combined (max values) vector 3.262275 0.00000 1.22636 1.0559217
with a sum of 5.5445567
. Whereas if I select 14 and 16 I get 0.611070 1.478159 0.00000 1.0146529
with a sum of 3.1038819
, thus the first option is better.
The solution for the above example for an n of 3 would be rows, 10, 14 and 9. I hope I could explain to problem well.
My approach would be to first select the row with the highest row wise sum, then literally the rows that would add the highest additional value. But I have the strong feeling that this not always gives the best solution. Calculating all possibilities combinations is not viable due to the size of the matrix. Is a genetic algorithm a solution? Or is there a simpler approach? Thanks.
Edit:
For easier understanding, here is an MWE:
# Create example matrix
mat <- matrix(c(1.562275, 0.000000, 0.00000, 0.0000000,2.843631, 0.000000, 1.22636, 1.0559217,0.000000, 0.000000, 0.00000, 0.9836209,1.572686, 0.000000, 0.00000, 1.8000293,0.000000, 0.000000, 0.00000, 0.0000000,1.611070, 1.478159, 0.00000, 0.0000000,0.000000, 0.000000, 0.00000, 0.0000000,0.000000, 0.000000, 0.00000, 1.0146529), byrow = TRUE, ncol = 4, dimnames = list(c(9:16), c("scen_1", "scen_2", "scen_3", "scen_4")))
# Function to evaluate each combination of rows (this value should be maximized)
get_combined_max_value_sum <- function(choosen_rows){
# Select rows
sel_mat <- mat[choosen_rows,]
# calculate columwise max
max_mat <- apply(sel_mat, 2, max)
# Sum the values
return(sum(max_mat))
}
# I am looking for the function best_rows() which returns the rows, which gives the
# maximum value (or at least a close guess) for the get_combined_max_value_sum()
# function
best_rows <- function(n_rows){
result <- vector()
# do some magic
return(result) # vector with length n_row for the "best" rows.
}
# ------------------------------------------------
# @ slamballais
# The rows with the highest rowise sum (10 & 12)
get_combined_max_value_sum(c("10","12"))
# get a lower score then row 9 and 13
get_combined_max_value_sum(c("10","14"))
Upvotes: 3
Views: 403
Reputation: 101014
You can define a recursive function f
(see it within function thomas2
), which can be any number of rows k
(1 <= k <= nrow(mat)
)
thomas2 <- function(mat, k) {
f <- function(mat, k) {
if (k == 1) {
return(which.max(rowSums(mat)))
}
p <- f(mat, k - 1)
q <- seq(nrow(mat))[-p]
rmax <- apply(mat[p, , drop = FALSE], 2, max)
c(p, q[which.max(sapply(q, function(k) sum(pmax(rmax, mat[k, ]))))])
}
row.names(mat)[sort(f(mat, k))]
}
For example
> thomas2(mat, 2)
[1] "10" "14"
> thomas2(mat, 3)
[1] "10" "12" "14"
> thomas2(mat, 4)
[1] "9" "10" "12" "14"
> thomas2(mat, 5)
[1] "9" "10" "11" "12" "14"
> thomas2(mat, 6)
[1] "9" "10" "11" "12" "13" "14"
Your algorithm is a greedy one, which cannot guarantee the global maximum always. Thus, a brute-force way might a straightforward workaround to reach your goal.
Maybe you can try the following brute-force method
rs <- combn(nrow(mat), 3)
row.names(mat)[rs[, which.max(apply(rs, 2, function(k) sum(do.call(pmax, data.frame(t(mat[k, ]))))))]]
which gives
[1] "10" "12" "14"
Upvotes: 3
Reputation: 3235
This is not the optimal answer, but it may inspire others...
Assumptions
k
rows, where k
is prespecified by the user.k
<= ncol(mat)
Answer
Certain rows will never be part of the answer. I propose to filter out those rows before applying a brute force approach. Filter conditions so far:
Code
slam <- function(mat, k) {
cm <- apply(mat, 2, max)
rs <- apply(mat, 1, function(x) sum(x[x > 0], na.rm = TRUE))
# remove rows whose sum is lower than the lowest column max
matb <- subset(mat, rs > min(cm))
# remove rows that have only values lower than all values of the rows containing a column max
mrows <- matb[apply(matb, 2, which.max), ]
any_bigger <- apply(mrows, 1, function(x) rowSums(sweep(matb, 2, x, `-`) >= 0) > 0)
matc <- matb[apply(any_bigger, 1, all), ]
# code copied + modified from @ThomasIsCoding's answer
rs <- combn(nrow(matc), k)
row.names(matc)[rs[, which.max(apply(rs, 2, function(z) sum(do.call(pmax, data.frame(t(matc[z, ]))))))]]
}
Example + Benchmark
# bigger dataset with 100 rows and negative values too
n <- 100
n2 <- 500
set.seed(2021)
mat2 <- matrix(rnorm(n * 4), ncol = 4, dimnames = list(c(1:n), c("scen_1", "scen_2", "scen_3", "scen_4")))
mat3 <- matrix(rnorm(n2 * 4), ncol = 4, dimnames = list(c(1:n2), c("scen_1", "scen_2", "scen_3", "scen_4")))
# verification
slam(mat, 3) # [1] "10" "12" "14"
thomas(mat) # [1] "10" "12" "14"
slam(mat2, 3) # [1] "25" "44" "99"
thomas(mat2) # [1] "25" "44" "99"
# benchmark (without `thomas(mat3)`, it takes too long)
microbenchmark::microbenchmark(slam(mat2, 3), thomas(mat2),
slam(mat3, 3), times = 1L)
# Unit: milliseconds
# expr min lq mean median uq max neval
# slam(mat2, 3) 249.4705 249.4705 249.4705 249.4705 249.4705 249.4705 1
# thomas(mat2) 19557.8194 19557.8194 19557.8194 19557.8194 19557.8194 19557.8194 1
# slam(mat3, 3) 16159.9113 16159.9113 16159.9113 16159.9113 16159.9113 16159.9113 1
Final thoughts
There is another way to do this. Start out with an initial combination of the k
rows that contain the k
largest maximum column values. For each of those rows, calculate whether there are other rows that provide a further gain in the remaining column(s). If there is a better row, try swapping it out with the initial combination. Keep repeating this process until the best rows have been selected. I don't have time to write it right now, but if it hasn't been done by tomorrow then I'll give it a shot.
Upvotes: 3