Reputation: 569
I have an array with values and probabilities (or in PHP, keys and values):
Value Probability
John 3
Peter 2
Paul 1
I want to shuffle this array but have the order be influenced by the probability. That is, on average John should appear at the top of the list half the time, Peter a third of the time, and Paul one sixth of the time, etc.* The average order over thousands of shuffles should be the one given above.
I have thought to
Create an array in which each value appears as often as indicated by the probability:
John
John
John
Peter
Peter
Paul
Shuffle this array, e.g.:
John
Paul
Peter
John
John
Peter
Remove the duplicates, leaving the first instance:
John
Paul
Peter
In PHP code this might look something like this:
$items = array(
'John' => 3,
'Peter' => 2,
'Paul' => 1
);
$probabilities = array();
foreach($items as $item => $value){
$interim = array_fill(1, $value, $item);
$probabilities = array_merge($probabilities, $interim);
}
shuffle($probabilities);
$result = array_unique($probabilities);
print("<pre>".print_r($result,true)."</pre>");
Is there a faster way to do this?
The real array contains a few hundred values and this probability-based shuffling has to be done everytime a webpage is called, so it should be lightning quick.
* Here are the probabilities for each position for each person:
Person P1 P2 P3
John 3/6 2/6 1/6
Peter 2/6 2/6 2/6
Paul 1/6 2/6 3/6
Note that the probabilities for the second and third rank follow from the probability for the first rank and are not explicitly given in the original data. They are just presented here for your convenience, because someone asked about them in the comments below.
The real array contains a few hundred values:
John 3 ← line 1
Peter 2
Paul 1
Simon 6
Robert 2
William 1
...
Sean 1 ← line 786
Upvotes: 0
Views: 115
Reputation: 57453
Sure, your original approach works. You do not need, however, to create large arrays.
The urn behaviour can be implemented like this (note: for smaller arrays I suspect your approach might be faster, and the fact it's not memory efficient could well be irrelevant. I haven't checked where the tipping point is):
function extractWithProbability(array $order): string {
if (1 === count($order)) {
return array_key_first($order);
}
$sum = array_sum($order);
$throw = rand(0, $sum - 1);
foreach ($order as $key => $value) {
if ($throw < $value) {
return $key;
}
$throw -= $value;
}
throw new RuntimeException('Unexpected');
}
First we get the total number of balls in our virtual urn. Each player contributes with a weight, so John gets three, Peter gets two, Paul gets one. Then we imagine the balls lined up one after another, so you extract at random the position where to pick the ball, and this goes from 0 to the number of balls minus one. This takes care of probability.
Then we build the array.
We could do this with uasort
and a custom sort function, but it would involve knowing that PHP is internally using quicksort() and a whole lot of other assumptions. We could get a very approximate probability function for the compare callback, but I don't think it would be worth the pain, and it could give occasional pathological results.
So, we simply "remove" all of a player's balls whenever he gets chosen, and repeat the extraction with the remaining players:
function shuffleWithProbability(array $order): array {
$result = [ ];
while (!empty($order)) {
$chosen = extractWithProbability($order);
$result[] = $chosen;
unset($order[$chosen]);
}
return $result;
}
On some hardwares and small arrays, it might be more convenient to neuter $order[$chosen]
by setting its value to 0. This means that the array isn't internally shuffled, which is a gain, but is then scanned in its entirety on each loop iteration, which is a loss (N^2 against 1/2 N^2 on average). For large N's, this is not convenient. We could try and offset this effect by running $order
through an array_filter
every K iterations (instead of unset, we would have setting to 0, and if, say, count($order) modulo X is Y, run it through array_filter
. X and Y will vary based on hardware, kernel configuration and PHP version - a maintenance nightmare).
To gauge results, run a certain number of extractions and check the actual scores:
srand(time());
$results = array_fill_keys(array_keys($order), 0.001);
for ($i = 0; $i < 1000000; $i++) {
$arr = shuffleWithProbability($order);
// Who came first?
$results[$arr[0]]++;
}
$least = min(array_values($results));
$results = array_map(static fn($v) => round($v / $least, 2), $results);
arsort($results);
foreach ($order as $name => $expected) {
$actual = $results[$name];
print("{$name} expected {$expected}, got {$actual}\n");
}
(The output display code assumes the least probable name always has a weight of 1)
With five names, these results are achieved in about 2.1s on a Core i9:
Zero expected 20, got 20
John expected 10, got 10
Luke expected 5, got 4.98
Paul expected 2, got 2
Matt expected 1, got 1
Using an array of 1000 names, one iteration takes on average 13 milliseconds.
With 500 names, 3.5 milliseconds (about 1/4th of the time), and with 2000 names, as expected, about 53 milliseconds (4x the time). The algorithm therefore appears to behave, as expected, as O(N^2).
I will explore the approach below some time in the next days, but I was wondering - how do things behave if I assign each player a random weight based on the player's desired odds and overall range? And then I sort the players based on the new, static weight?
In other words, I start with three W values for each player
Paul = 10, John = 5, Luke = 1
The overall range is 16. I assign to each player a F(W) that, for example (I don't think this would work, the question is whether some other F() might work. For example: F(x) = the highest of x random extractions? Or a power function of x/range with phi?) F(x) = rand()/getrandmax()*x. So Paul extracts 0.21 and gets 2.1, John extracts 0.7 and gets 3.5, Luke extracts 0.11 and gets 0.11. The O(N) processed array is [ Paul: 2.1, John: 3.5, Luke: 0.11 ]. I quicksort this array in O(N log N), getting [ John, Paul, Luke ] in O(N log N). How often does each player end first place?
Upvotes: 2
Reputation: 14978
Converting the table in your question:
Person P1 P2 P3
John 3/6 2/6 1/6
Peter 2/6 2/6 2/6
Paul 1/6 2/6 3/6
to SQL:
select
value,
Probability,
@r:=rand(),
case when @r < Probability/6 then 1
when @r < (Probability/6 + 2/6) then 2
else 3 end as P
from mytable
order by P;
Or, doing this server times, and calculating the times a certain position is achieved
with recursive
nrs as (
select 1 as x
union all
select x+1 from nrs where x<1000
)
select
value,
Probability,
count(CASE WHEN P=1 THEN 1 END) as P1,
count(CASE WHEN P=2 THEN 1 END) as P2,
count(CASE WHEN P=3 THEN 1 END) as P3
from (
select
value,
Probability,
@r:=rand(),
case when @r < Probability/6 then 1
when @r < (Probability/6 + 2/6) then 2
else 3 end as P
from mytable,nrs
) y
group by value,Probability
order by Probability desc;
see: DBFIDDLE (not the previous posted: https://dbfiddle.uk/0t10GD9S)
Upvotes: 1
Reputation: 9153
When having a large array and you want weighted random values, you could create a new array having higher weighted values occuring more often, but this will consume a lot of extra memory.
Using the total weight, you can subtract the weight of a random entitiy until the weight is going negative and return it. This will preserve the weighting.
Here we have 1.000.000 samples, so John gets most parts, then Peter, then Paul.
// Normalized weighting (all together sum up to 1)
$sum = 3+2+1;
$array = [
'John' => 3 / $sum,
'Peter' => 2 / $sum,
'Paul' => 1 / $sum,
];
$keys = array_keys($array);
$samples = 1e6;
$pool = array_fill_keys($keys, 0);
$len = count($array) - 1;
for($i = 0; $i < $samples; $i++) {
$weight = 1;
while($weight > 0) {
$key = $keys[rand(0, $len)];
$weight -= $array[$key];
}
$pool[$key]++;
}
Dump of the pool:
array(3) {
["John"] => int(588533)
["Peter"]=> int(289020)
["Paul"] => int(122447)
}
Upvotes: 0