Ben
Ben

Reputation: 569

Fastest way to randomly shuffle an array based on probabilities in PHP?

I have an array with values and probabilities (or in PHP, keys and values):

Value   Probability
John    3
Peter   2
Paul    1

I want to shuffle this array but have the order be influenced by the probability. That is, on average John should appear at the top of the list half the time, Peter a third of the time, and Paul one sixth of the time, etc.* The average order over thousands of shuffles should be the one given above.

I have thought to

  1. Create an array in which each value appears as often as indicated by the probability:

    John
    John
    John
    Peter
    Peter
    Paul
    
  2. Shuffle this array, e.g.:

    John
    Paul
    Peter
    John
    John
    Peter
    
  3. Remove the duplicates, leaving the first instance:

    John
    Paul
    Peter
    

In PHP code this might look something like this:

$items = array(
    'John' => 3,
    'Peter' => 2,
    'Paul' => 1
);

$probabilities = array();
foreach($items as $item => $value){
    $interim = array_fill(1, $value, $item);
    $probabilities = array_merge($probabilities, $interim);
}

shuffle($probabilities);

$result = array_unique($probabilities);

print("<pre>".print_r($result,true)."</pre>");

Is there a faster way to do this?

The real array contains a few hundred values and this probability-based shuffling has to be done everytime a webpage is called, so it should be lightning quick.


* Here are the probabilities for each position for each person:

Person P1  P2  P3
John   3/6 2/6 1/6
Peter  2/6 2/6 2/6
Paul   1/6 2/6 3/6

Note that the probabilities for the second and third rank follow from the probability for the first rank and are not explicitly given in the original data. They are just presented here for your convenience, because someone asked about them in the comments below.


The real array contains a few hundred values:

John 3  ← line 1
Peter 2
Paul 1
Simon 6
Robert 2
William 1
...
Sean 1  ← line 786

Upvotes: 0

Views: 115

Answers (3)

LSerni
LSerni

Reputation: 57453

Sure, your original approach works. You do not need, however, to create large arrays.

The urn behaviour can be implemented like this (note: for smaller arrays I suspect your approach might be faster, and the fact it's not memory efficient could well be irrelevant. I haven't checked where the tipping point is):

    function extractWithProbability(array $order): string {
        if (1 === count($order)) {
            return array_key_first($order);
        }
        $sum    = array_sum($order);
        $throw  = rand(0, $sum - 1);
        foreach ($order as $key => $value) {
            if ($throw < $value) {
                return $key;
            }
            $throw -= $value;
        }
        throw new RuntimeException('Unexpected');
    }

First we get the total number of balls in our virtual urn. Each player contributes with a weight, so John gets three, Peter gets two, Paul gets one. Then we imagine the balls lined up one after another, so you extract at random the position where to pick the ball, and this goes from 0 to the number of balls minus one. This takes care of probability.

Then we build the array.

We could do this with uasort and a custom sort function, but it would involve knowing that PHP is internally using quicksort() and a whole lot of other assumptions. We could get a very approximate probability function for the compare callback, but I don't think it would be worth the pain, and it could give occasional pathological results.

So, we simply "remove" all of a player's balls whenever he gets chosen, and repeat the extraction with the remaining players:

function shuffleWithProbability(array $order): array {
        $result = [ ];
        while (!empty($order)) {
            $chosen = extractWithProbability($order);
            $result[] = $chosen;
            unset($order[$chosen]);
        }
        return $result;
    }

On some hardwares and small arrays, it might be more convenient to neuter $order[$chosen] by setting its value to 0. This means that the array isn't internally shuffled, which is a gain, but is then scanned in its entirety on each loop iteration, which is a loss (N^2 against 1/2 N^2 on average). For large N's, this is not convenient. We could try and offset this effect by running $order through an array_filter every K iterations (instead of unset, we would have setting to 0, and if, say, count($order) modulo X is Y, run it through array_filter. X and Y will vary based on hardware, kernel configuration and PHP version - a maintenance nightmare).

To gauge results, run a certain number of extractions and check the actual scores:

    srand(time());

    $results = array_fill_keys(array_keys($order), 0.001);

    for ($i = 0; $i < 1000000; $i++) {
        $arr = shuffleWithProbability($order);
        // Who came first?
        $results[$arr[0]]++;
    }
    $least   = min(array_values($results));
    $results = array_map(static fn($v) => round($v / $least, 2), $results);
    arsort($results);
    foreach ($order as $name => $expected) {
        $actual = $results[$name];
        print("{$name} expected {$expected}, got {$actual}\n");
    }

(The output display code assumes the least probable name always has a weight of 1)

With five names, these results are achieved in about 2.1s on a Core i9:

Zero expected 20, got 20

John expected 10, got 10

Luke expected 5, got 4.98

Paul expected 2, got 2

Matt expected 1, got 1

Using an array of 1000 names, one iteration takes on average 13 milliseconds.

With 500 names, 3.5 milliseconds (about 1/4th of the time), and with 2000 names, as expected, about 53 milliseconds (4x the time). The algorithm therefore appears to behave, as expected, as O(N^2).

Random, idle musing

I will explore the approach below some time in the next days, but I was wondering - how do things behave if I assign each player a random weight based on the player's desired odds and overall range? And then I sort the players based on the new, static weight?

In other words, I start with three W values for each player

Paul = 10, John = 5, Luke = 1

The overall range is 16. I assign to each player a F(W) that, for example (I don't think this would work, the question is whether some other F() might work. For example: F(x) = the highest of x random extractions? Or a power function of x/range with phi?) F(x) = rand()/getrandmax()*x. So Paul extracts 0.21 and gets 2.1, John extracts 0.7 and gets 3.5, Luke extracts 0.11 and gets 0.11. The O(N) processed array is [ Paul: 2.1, John: 3.5, Luke: 0.11 ]. I quicksort this array in O(N log N), getting [ John, Paul, Luke ] in O(N log N). How often does each player end first place?

Upvotes: 2

Luuk
Luuk

Reputation: 14978

Converting the table in your question:

Person P1  P2  P3
John   3/6 2/6 1/6
Peter  2/6 2/6 2/6
Paul   1/6 2/6 3/6

to SQL:

select 
   value, 
   Probability,
   @r:=rand(),
   case when @r < Probability/6 then 1 
        when @r < (Probability/6 + 2/6) then 2 
        else 3 end as P
from mytable
order by P;

Or, doing this server times, and calculating the times a certain position is achieved

with recursive 
nrs as (
   select 1 as x
   union  all
   select x+1 from nrs where x<1000
)
select 
   value, 
   Probability, 
   count(CASE WHEN P=1 THEN 1 END) as P1,
   count(CASE WHEN P=2 THEN 1 END) as P2,
   count(CASE WHEN P=3 THEN 1 END) as P3
from (
    select 
       value, 
       Probability,
       @r:=rand(),
       case when @r < Probability/6 then 1 
            when @r < (Probability/6 + 2/6) then 2 
            else 3 end as P
    from mytable,nrs
) y 
group by value,Probability
order by Probability desc;

see: DBFIDDLE (not the previous posted: https://dbfiddle.uk/0t10GD9S)

Upvotes: 1

Markus Zeller
Markus Zeller

Reputation: 9153

When having a large array and you want weighted random values, you could create a new array having higher weighted values occuring more often, but this will consume a lot of extra memory.

Using the total weight, you can subtract the weight of a random entitiy until the weight is going negative and return it. This will preserve the weighting.

Here we have 1.000.000 samples, so John gets most parts, then Peter, then Paul.

// Normalized weighting (all together sum up to 1)
$sum = 3+2+1;
$array = [
    'John'  => 3 / $sum,
    'Peter' => 2 / $sum,
    'Paul'  => 1 / $sum,
];

$keys = array_keys($array);
$samples = 1e6;
$pool = array_fill_keys($keys, 0);
$len = count($array) - 1;

for($i = 0; $i < $samples; $i++) {
    $weight = 1;
    while($weight > 0) {
        $key = $keys[rand(0, $len)];
        $weight -= $array[$key];
    }
    $pool[$key]++;
}

Dump of the pool:

array(3) {
  ["John"] => int(588533)
  ["Peter"]=> int(289020)
  ["Paul"] => int(122447)
}

Upvotes: 0

Related Questions