WitheShadow
WitheShadow

Reputation: 542

Select n row have highest combined value from a matrix in R

This is part of a large matrix (dimension around: 1'000-1'000'000 rows x 100 - 1'000 columns):

     scen_1   scen_2  scen_3    scen_4 ...
...
9  3.262275 0.000000 0.00000 0.0000000 ...
10 2.843631 0.000000 1.22636 1.0559217 ...
11 0.000000 0.000000 0.00000 0.9836209 ...
12 2.572686 0.000000 0.00000 1.1000293 ...
13 0.000000 0.000000 0.00000 0.0000000 ...
14 0.611070 1.478159 0.00000 0.0000000 ...
15 0.000000 0.000000 0.00000 0.0000000 ...
16 0.000000 0.000000 0.00000 1.0146529 ...
...

Now, I want to select n rows, which - after getting for each column the maximum - have the highest sum, thus row that complement each other well. E.g. I select row 9 and 10 I get the combined (max values) vector 3.262275 0.00000 1.22636 1.0559217 with a sum of 5.5445567. Whereas if I select 14 and 16 I get 0.611070 1.478159 0.00000 1.0146529 with a sum of 3.1038819, thus the first option is better.

The solution for the above example for an n of 3 would be rows, 10, 14 and 9. I hope I could explain to problem well.

My approach would be to first select the row with the highest row wise sum, then literally the rows that would add the highest additional value. But I have the strong feeling that this not always gives the best solution. Calculating all possibilities combinations is not viable due to the size of the matrix. Is a genetic algorithm a solution? Or is there a simpler approach? Thanks.

Edit:

For easier understanding, here is an MWE:

# Create example matrix
mat <- matrix(c(1.562275, 0.000000, 0.00000, 0.0000000,2.843631, 0.000000, 1.22636, 1.0559217,0.000000, 0.000000, 0.00000, 0.9836209,1.572686, 0.000000, 0.00000, 1.8000293,0.000000, 0.000000, 0.00000, 0.0000000,1.611070, 1.478159, 0.00000, 0.0000000,0.000000, 0.000000, 0.00000, 0.0000000,0.000000, 0.000000, 0.00000, 1.0146529), byrow = TRUE,  ncol = 4, dimnames = list(c(9:16), c("scen_1",  "scen_2",  "scen_3", "scen_4")))

# Function to evaluate each combination of rows (this value should be maximized)
get_combined_max_value_sum <- function(choosen_rows){
  # Select rows
  sel_mat <- mat[choosen_rows,]
  
  # calculate columwise max
  max_mat <- apply(sel_mat, 2, max)
  
  # Sum the values
  return(sum(max_mat))
}

# I am looking for the function best_rows() which returns the rows, which gives the 
# maximum value (or at least a close guess) for the get_combined_max_value_sum() 
# function
best_rows <- function(n_rows){
  result <- vector()
  
  # do some magic
  
  return(result) # vector with length n_row for the "best" rows.
}

# ------------------------------------------------
# @ slamballais
# The rows with the highest rowise sum (10 & 12)
get_combined_max_value_sum(c("10","12"))

# get a lower score then row 9 and 13
get_combined_max_value_sum(c("10","14"))

Upvotes: 3

Views: 403

Answers (2)

ThomasIsCoding
ThomasIsCoding

Reputation: 101014

Update (Recursive approach, sub-optimal solution)

You can define a recursive function f (see it within function thomas2), which can be any number of rows k (1 <= k <= nrow(mat))

thomas2 <- function(mat, k) {
  f <- function(mat, k) {
    if (k == 1) {
      return(which.max(rowSums(mat)))
    }
    p <- f(mat, k - 1)
    q <- seq(nrow(mat))[-p]
    rmax <- apply(mat[p, , drop = FALSE], 2, max)
    c(p, q[which.max(sapply(q, function(k) sum(pmax(rmax, mat[k, ]))))])
  }
  row.names(mat)[sort(f(mat, k))]
}

For example

> thomas2(mat, 2)
[1] "10" "14"

> thomas2(mat, 3)
[1] "10" "12" "14"

> thomas2(mat, 4)
[1] "9"  "10" "12" "14"

> thomas2(mat, 5)
[1] "9"  "10" "11" "12" "14"

> thomas2(mat, 6)
[1] "9"  "10" "11" "12" "13" "14"

Previous answer (Brute-force approach, inefficient)

Your algorithm is a greedy one, which cannot guarantee the global maximum always. Thus, a brute-force way might a straightforward workaround to reach your goal.

Maybe you can try the following brute-force method

rs <- combn(nrow(mat), 3)
row.names(mat)[rs[, which.max(apply(rs, 2, function(k) sum(do.call(pmax, data.frame(t(mat[k, ]))))))]]

which gives

[1] "10" "12" "14"

Upvotes: 3

slamballais
slamballais

Reputation: 3235

This is not the optimal answer, but it may inspire others...

Assumptions

  • The answer has k rows, where k is prespecified by the user.
  • k <= ncol(mat)

Answer

Certain rows will never be part of the answer. I propose to filter out those rows before applying a brute force approach. Filter conditions so far:

  • Remove rows whose sum is lower than the lowest maximum column value
  • Remove rows where all values are lower than any row that contains a maximum column value

Code

slam <- function(mat, k) {
  cm <- apply(mat, 2, max)
  rs <- apply(mat, 1, function(x) sum(x[x > 0], na.rm = TRUE))
  
  # remove rows whose sum is lower than the lowest column max
  matb <- subset(mat, rs > min(cm))
  
  # remove rows that have only values lower than all values of the rows containing a column max
  mrows <- matb[apply(matb, 2, which.max), ]
  any_bigger <- apply(mrows, 1, function(x) rowSums(sweep(matb, 2, x, `-`) >= 0) > 0)
  matc <- matb[apply(any_bigger, 1, all), ]
  
  # code copied + modified from @ThomasIsCoding's answer
  rs <- combn(nrow(matc), k)
  row.names(matc)[rs[, which.max(apply(rs, 2, function(z) sum(do.call(pmax, data.frame(t(matc[z, ]))))))]]
}

Example + Benchmark


# bigger dataset with 100 rows and negative values too
n <- 100
n2 <- 500
set.seed(2021)
mat2 <- matrix(rnorm(n * 4), ncol = 4, dimnames = list(c(1:n), c("scen_1",  "scen_2",  "scen_3", "scen_4")))
mat3 <- matrix(rnorm(n2 * 4), ncol = 4, dimnames = list(c(1:n2), c("scen_1",  "scen_2",  "scen_3", "scen_4")))

# verification
slam(mat, 3)     # [1] "10" "12" "14"
thomas(mat)      # [1] "10" "12" "14"
slam(mat2, 3)    # [1] "25" "44" "99"
thomas(mat2)     # [1] "25" "44" "99"

# benchmark (without `thomas(mat3)`, it takes too long)
microbenchmark::microbenchmark(slam(mat2, 3), thomas(mat2),
                               slam(mat3, 3), times = 1L)

# Unit: milliseconds
#          expr        min         lq       mean     median         uq        max neval
# slam(mat2, 3)   249.4705   249.4705   249.4705   249.4705   249.4705   249.4705     1
#  thomas(mat2) 19557.8194 19557.8194 19557.8194 19557.8194 19557.8194 19557.8194     1
# slam(mat3, 3) 16159.9113 16159.9113 16159.9113 16159.9113 16159.9113 16159.9113     1

Final thoughts

There is another way to do this. Start out with an initial combination of the k rows that contain the k largest maximum column values. For each of those rows, calculate whether there are other rows that provide a further gain in the remaining column(s). If there is a better row, try swapping it out with the initial combination. Keep repeating this process until the best rows have been selected. I don't have time to write it right now, but if it hasn't been done by tomorrow then I'll give it a shot.

Upvotes: 3

Related Questions