JoshDG
JoshDG

Reputation: 3931

Sample with constraint, vectorized

What is the most efficient way to sample a data frame under a certain constraint?

For example, say I have a directory of Names and Salaries, how do I select 3 such that their sum does not exceed some value. I'm just using a while loop but that seems pretty inefficient.

Upvotes: 1

Views: 628

Answers (3)

Eduardo Leoni
Eduardo Leoni

Reputation: 9050

How big is your dataset? If it is small (and small really depends on your hardware), you could just list all groups of three, calculate the sum, and sample from that.

## create data frame
N <- 100
salary <- rnorm(N))
## list all possible groups of 3 from this
x <- combn(salary, 3)
## the sum
sx <- colSums(x)
sxc <- sx[sx<1]
## sampling with replacement
sample(sxc, 10, replace=TRUE) 

Upvotes: 1

IRTFM
IRTFM

Reputation: 263342

You could face a combinatorial explosion. This simulates the selection of 3 combinations of the EE's from a set of 20 with salaries at a mean of 60 and sd 20. It shows that from the enumeration of the 1140 combinations you will find only 263 having sum of salaries less than 150.

> sum( apply( combn(1:20,3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 200
> set.seed(123)

> salry <- data.frame(EEnams = sapply(1:20 , 
                                       function(x){paste(sample(letters[1:20], 6) , 
                                             collapse="")}), sals = rnorm(20, 60, 20))
> head(salry)
  EEnams     sals
1 fohpqa 67.59279
2 kqjhpg 49.95353
3 nkbpda 53.33585
4 gsqlko 39.62849
5 ntjkec 38.56418
6 trmnah 66.07057

> sum( apply( combn(1:NROW(salry), 3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 263

If you had 1000 EE's then you would have:

> choose(1000, 3)   # Combination possibilities
# [1] 166,167,000   Commas added to output

Upvotes: 1

Brian Diggs
Brian Diggs

Reputation: 58825

One approach would be to start with the full data frame and sample one case. Create a data frame which consists of all the cases which have a salary less than your constraint minus the selected salary. Select a second case from this and repeat the process of creating a remaining set of cases to choose from. Stop if you get to the number you need (3), or if at any point there are no cases in the data frame to choose from (reject what you have so far and restart the sampling procedure).

Note that different approaches will create different probability distributions for a case being included; generally it won't be uniform.

Upvotes: 1

Related Questions