galman
galman

Reputation: 38

Data frame subset with specified sum of elements

Having a data frame like this:

   df <- data.frame(a=c(31, 18, 0, 1, 20, 2), 
   b=c(1, 0, 0, 3, 1, 1), 
   c=c(12, 0, 9, 8, 10, 3))

   > df
      a b  c
   1 31 1 12
   2 18 0  0
   3  0 0  9
   4  1 3  8
   5 20 1 10
   6  2 1  3

How can I do a random subset so the sum of rows and columns is equal to a value, i.e , 100?

Upvotes: 0

Views: 160

Answers (1)

josliber
josliber

Reputation: 44309

As I understand your question, you're trying to sample a subset of the rows and columns of your matrix so that they sum to a target value.

You can use integer optimization to accomplish this. You'll have a binary decision variable for each row, column, and cell, and constraints to force the cell values to be equal to the product of the row and column values. I'll use the lpSolve package to do this, because it has a convenient mechanism to get multiple optimal solutions. We can then use the sample function to select between them:

library(lpSolve)
get.subset <- function(dat, target) {
  nr <- nrow(dat)
  nc <- ncol(dat)
  nvar <- nr + nc + nr*nc
  # Cells upper bounded by row and column variable values (r and c) and lower bounded by r+c-1
  mat <- as.matrix(do.call(rbind, apply(expand.grid(seq(nr), seq(nc)), 1, function(x) {
    r <- x[1]
    c <- x[2]
    pos <- nr + nc + (r-1)*nc + c
    ltc <- rep(0, nvar)
    ltc[nr + c] <- 1
    ltc[pos] <- -1
    ltr <- rep(0, nvar)
    ltr[r] <- 1
    ltr[pos] <- -1
    gtrc <- rep(0, nvar)
    gtrc[nr + c] <- 1
    gtrc[r] <- 1
    gtrc[pos] <- -1
    return(as.data.frame(rbind(ltc, ltr, gtrc)))
  })))
  dir <- rep(c(">=", ">=", "<="), nr*nc)
  rhs <- rep(c(0, 0, 1), nr*nc)

  # Sum of selected cells equals target
  mat <- rbind(mat, c(rep(0, nr+nc), as.vector(t(dat))))
  dir <- c(dir, "=")
  rhs <- c(rhs, target)

  res <- lp(objective.in=rep(0, nvar),  # Feasibility problem
            const.mat=mat,
            const.dir=dir,
            const.rhs=rhs,
            all.bin=TRUE,
            num.bin.solns=100  # Number of feasible solutions to get
            )
  if (res$status != 0) {
    return(list(rows=NA, cols=NA, subset=NA, num.sol=0))
  }
  sol.num <- sample(res$num.bin.solns, 1)
  vals <- res$solution[seq((sol.num-1)*nvar+1, sol.num*nvar)]
  rows <- which(vals[seq(nr)] >= 0.999)
  cols <- which(vals[seq(nr+1, nr+nc)] >= 0.999)
  return(list(rows=rows, cols=cols, subset=dat[rows,cols], num.sol=res$num.bin.solns))
}

The function returns the number of subset with that sum and returns the randomly selected subset:

set.seed(144)
get.subset(df, 1)
# $rows
# [1] 1
# $cols
# [1] 2
# $subset
# [1] 1
# $num.sol
# [1] 14

get.subset(df, 100)
# $rows
# [1] 1 2 4 5
# $cols
# [1] 1 3
# $subset
#    a  c
# 1 31 12
# 2 18  0
# 4  1  8
# 5 20 10
# $num.sol
# [1] 2

get.subset(df, 10000)
# $rows
# [1] NA
# $cols
# [1] NA
# $subset
# [1] NA
# $num.sol
# [1] 0

Upvotes: 2

Related Questions