crockpotveggies
crockpotveggies

Reputation: 13320

Scala: how can I generate numbers according to an expected distribution?

Sometimes Scala's random doesn't help when wanting to generate random numbers of smaller ranges or when you already know that certain numbers already have an associated probability attached to them. Obviously, scala.util.Random.nextInt won't get the whole job done.

How do I select numbers according to their weights?

Upvotes: 8

Views: 8179

Answers (2)

dhg
dhg

Reputation: 52691

Here's a simple explanation. Imagine you have multinomial over the values A,B,C where they have the following probabilities:

  • A = 0.5
  • B = 0.3
  • C = 0.2

If you want to sample a value according to its probability, then that means that roughly 50% of the time you want to get A, 30% of the time B, and 20% of the time C.

So imagine your distribution as a broken stick:

       A         B      C
      0.5       0.3    0.2
|------------|-------|-----|
0           0.5     0.8   1.0

The procedure for sampling from the multinomial starts with a random value p sampled uniformly between 0 and 1. Then you check which portion of the stick p falls in, and return the corresponding value.

So if p = 0.7, then:

       A         B      C
      0.5       0.3    0.2
|------------|-------|-----|
0           0.5     0.8   1.0
                  ^
                 p=0.7

and you would return B.

Code-wise, then would look like:

final def sample[A](dist: Map[A, Double]): A = {
  val p = scala.util.Random.nextDouble
  val it = dist.iterator
  var accum = 0.0
  while (it.hasNext) {
    val (item, itemProb) = it.next
    accum += itemProb
    if (accum >= p)
      return item  // return so that we don't have to search through the whole distribution
  }
  sys.error(f"this should never happen")  // needed so it will compile
}

And you can check it like this:

val dist = Map('A -> 0.5, 'B -> 0.3, 'C -> 0.2)
sample(dist)  // 'A

Vector.fill(1000)(sample(dist)).groupBy(identity).mapValues(_.size) // Map('A -> 510, 'B -> 300, 'C -> 190)

Other things:

  • If dist is not a probability distribution (ie, weights don't sum to one), then you would simply use p = nextDouble * dist.values.sum. So if dist summed to 0.5, then p would be uniform between 0.0 and 0.5; if it summed to 20, then p would be uniform between 0.0 and 20.0.

There are other optimizations you can do, like sorting the entries with the largest probabilities first so that you minimize the number of entries you have to look through before you accumulate p probability mass, but this should get you through the basic idea.

Upvotes: 25

crockpotveggies
crockpotveggies

Reputation: 13320

I was able to adapt a script from this Gist https://gist.github.com/anonymous/2033568 to achieve a single weighted random selection from a list of weighted numbers. Here's an example how to implement it:

val range = (min to max)

// use probabilities from the existing dataset to generate numbers
val weightedItems = range.map { number =>
  val w = Probability.pNumber(datasetId, number, i)
  WeightedItem[Int](number, w)
}
val selection = WeightedRandomSelection.singleWeightedSelection(weightedItems)

And here's the Gist slightly adapted:

object WeightedRandomSelection {

  /**
   * Get the number of times an event with probability p occurs in N samples.
   * if R is res, then P(R=n) = p^n q^(N-n) N! / n! / (N-n)!
   * where q = 1-p
   * This has the property that P(R=0) = q^N, and
   * P(R=n+1) = p/q (N-n)/(n+1) P(R=n)
   * Also note that P(R=n+1|R>n) = P(R=n+1)/P(R>n)
   * Uses these facts to work out the probability that the result is zero. If
   * not, then the prob that given that, the result is 1, etc.
   */
  def numEntries(p:Double,N:Int,r:Random) : Int = if (p>0.5) N-numEntries(1.0-p,N,r) else if (p<0.0) 0 else {
    var n = 0
    val q = 1.0-p
    var prstop = Math.pow(q,N)
    var cumulative = 0.0
    while (n<N && (r.nextDouble()*(1-cumulative))>=prstop) {
      cumulative+=prstop
      prstop*=p*(N-n)/(q*(n+1))
      n+=1
    }
    n
  }


  case class WeightedItem[T](item: T, weight: Double)

  /**
   * Compute a weighted selection from the given items.
   * cumulativeSum must be the same length as items (or longer), with the ith element containing the sum of all
   * weights from the item i to the end of the list. This is done in a saved way rather than adding up and then
   * subtracting in order to prevent rounding errors from causing a variety of subtle problems.
   */
  private def weightedSelectionWithCumSum[T](items: Seq[WeightedItem[T]],cumulativeSum:List[Double], numSelections:Int, r: Random) : Seq[T] = {
    if (numSelections==0) Nil
    else {
      val head = items.head
      val nhead = numEntries(head.weight/cumulativeSum.head,numSelections,r)
      List.fill(nhead)(head.item)++weightedSelectionWithCumSum(items.tail,cumulativeSum.tail,numSelections-nhead,r)
    }
  }


  def weightedSelection[T](items: Seq[WeightedItem[T]], numSelections:Int, r: Random): Seq[T] = {
    val cumsum = items.foldRight(List(0.0)){(wi,l)=>(wi.weight+l.head)::l}
    weightedSelectionWithCumSum(items,cumsum,numSelections,r)
  }

  def singleWeightedSelection[T](items: Seq[WeightedItem[T]]): T = {
    val r = new scala.util.Random()
    val numSelections = 1
    val cumsum = items.foldRight(List(0.0)){(wi,l)=>(wi.weight+l.head)::l}
    weightedSelectionWithCumSum(items,cumsum,numSelections,r).head
  }


  def testRandomness[T](items: Seq[WeightedItem[T]], numSelections:Int, r: Random) {
    val runs = 10000
    val indexOfItem = Map.empty++items.zipWithIndex.map{case (item,ind)=>item.item->ind}
    val numItems = items.length
    val bucketSums = new Array[Double](numItems)
    val bucketSumSqs = new Array[Double](numItems)
    for (run<-0 until runs) {
      // compute chi-squared for a run
      val runresult = weightedSelection(items,numSelections,r)
      val buckets = new Array[Double](numItems)
      for (r<-runresult) buckets(indexOfItem(r))+=1
      for (i<-0 until numItems) {
        val count = buckets(i)
        bucketSums(i)+=count
        bucketSumSqs(i)+=count*count
      }
    }
    val sumWeights = items.foldLeft(0.0)(_+_.weight)
    for ((item,ind)<-items.zipWithIndex) {
      val p = item.weight/sumWeights
      val mean = bucketSums(ind)/runs
      val variance = bucketSumSqs(ind)/runs-mean*mean
      val expectedMean = numSelections*p
      val expectedVariance = numSelections*p*(1-p)
      val expectedErrorInMean = Math.sqrt(expectedVariance/runs)
      val text = "Item %10s Mean %.3f Expected %.3f±%.3f Variance %.3f expected %.3f".format(item.item,mean,expectedMean,expectedErrorInMean,variance,expectedVariance)
      println(text)
    }
  }

}

Upvotes: 0

Related Questions