Reputation: 582
I have a set of numbers for a given set of attributes:
red = 4
blue = 0
orange = 2
purple = 1
I need to calculate the distribution percentage. Meaning, how diverse is the selection? Is it 20% diverse? Is it 100% diverse (meaning an even distribution of say 4,4,4,4)?
I'm trying to create a sexy percentage that approaches 100% the more the individual values average to the same value, and a lower value the more they get lopsided.
Has anyone done this?
Here is the PHP conversion of the below example. For some reason it's not producing 1.0 with a 4,4,4,4 example.
$arrayChoices = array(4,4,4,4);
foreach($arrayChoices as $p)
$sum += $p;
print "sum: ".$sum."<br>";
$pArray = array();
foreach($arrayChoices as $rec)
{
print "p vector value: ".$rec." ".$rec / $sum."\n<br>";
array_push($pArray,$rec / $sum);
}
$total = 0;
foreach($pArray as $p)
if($p > 0)
$total = $total - $p*log($p,2);
print "total = $total <br>";
print round($total / log(count($pArray),2) *100);
Thanks in advance!
Upvotes: 1
Views: 417
Reputation: 19855
One possibility would be to base your measure on entropy. The uniform distribution has maximum entropy, so you could create a measure as follows:
1) Convert your vector of counts to P
, a vector of proportions
(probabilities).
2) Calculate the entropy function H(P)
for your vector of
probabilities P
.
3) Calculate the entropy function H(U)
for a vector of equal
probabilities which has the same length as P
. (This turns out
to be H(U) = -log(1.0 / length(P))
, so you don't actually
need to create U
as a vector.)
4) Your diversity measure would be 100 * H(P) / H(U)
.
Any set of equal counts yields a diversity of 100. When I applied this to your (4, 0, 2, 1) case, the diversity was 68.94. Any vector with all but one element having counts of 0 has diversity 0.
ADDENDUM
Now with source code! I implemented this in Ruby.
def relative_entropy(v)
# Sum all the values in the vector v, convert to decimal
# so we won't have integer division below...
sum = v.inject(:+).to_f
# Divide each value in v by sum, store in new array p
pvals = v.map{|value| value / sum}
# Build a running total by calculating the entropy contribution for
# each p. Entropy is zero if p is zero, in which case total is unchanged.
# Finally, scale by the entropy equivalent of all proportions being equal.
pvals.inject(0){|total,p| p > 0 ? (total - p*Math.log2(p)) : total} / Math.log2(pvals.length)
end
# Scale these by 100 to turn into a percentage-like measure
relative_entropy([4,4,4,4]) # => 1.0
relative_entropy([4,0,2,1]) # => 0.6893917467430877
relative_entropy([16,0,0,0]) # => 0.0
Upvotes: 1
Reputation: 1637
A simple, if rather naive, scheme is to sum the absolute differences between your observations and a perfectly uniform distribution
red = abs(4 - 7/4) = 9/4
blue = abs(0 - 7/4) = 7/4
orange = abs(2 - 7/4) = 1/4
purple = abs(1 - 7/4) = 3/4
for a total of 5.
A perfectly even spread will have a score of zero which you must map to 100%.
Assuming you have n
items in c
categories, a perfectly uneven spread will have a score of
(c-1)*n/c + 1*(n-n/c) = 2*(n-n/c)
which you should map to 0%. For a score d
, you might use the linear transformation
100% * (1 - d / (2*(n-n/c)))
For your example this would result in
100% * (1 - 5 / (2*(7-7/4))) = 100% * (1 - 10/21) ~ 52%
Better yet (although more complicated) is the Kolmogorov–Smirnov statistic with which you can make mathematically rigorous statements about the probability that a set of observations have some given underlying probability distribution.
Upvotes: 2