Colin
Colin

Reputation: 415

How to determine which row in dataframe has most even and highest distribution

I would like to sort a pandas dataframe by the rows which have the most even distribution but also high values. For example:

Row  Attribute1  Attribute2 Attribute3
a        1           1          108
b        10          2          145
c        50          60         55
d        100         90         120
e        20          25         23
f        1000        30         0

Rows d and c should rank the highest, ideally d followed by c.

I considered using standard deviation to identify the most even distribution and then mean to get the highest average values but I'm unsure as to how I can combine these together.

Upvotes: 0

Views: 155

Answers (1)

Hugolmn
Hugolmn

Reputation: 1560

As the perception of "even distribution" you mention seems to be quite subjective, here is an instuction to implement the coefficient of variation mentionned by @ALollz.

df.std(axis=1) / df.mean(axis=1)

Row 0
a   1.6848130582715446
b   1.535375387727906
c   0.09090909090909091
d   0.14782502241793033
e   0.11102697698927574
f   1.6569547684031352

This metrics is the percentage of the mean represented by the standard deviation. If you have a row mean of 10 and a standard deviation of 1, the ratio will be 10% or 0.1

In this example, the row that could be considered most 'evenly distributed' is the row c: its mean is 55 and its standard deviation is 5. Therefore the ratio is about 9%.

This way, you can have a decent overview of the homogeneity of the distribution.

If you want the ranking, you can apply .sort_values:

(df.std(axis=1) / df.mean(axis=1)).sort_values()
Row 0
c   0.09090909090909091
e   0.11102697698927574
d   0.14782502241793033
b   1.535375387727906
f   1.6569547684031352
a   1.6848130582715446

My last words would be to not be fooled by our brain's perception: it can be easily tricked by statistics.

Now if you want to improve results of higher values, you can divide this coefficient by the mean: the higher the mean, the lower the coefficient.

(df.std(axis=1) / df.mean(axis=1)**2).sort_values()

Row 0
d   0.0014305647330767452
c   0.001652892561983471
f   0.004826081849717869
e   0.004898248984820989
b   0.029338383204991835
a   0.045949447043769395

And now we obtain the desired ranking : d first, then c, f, e, b and a

Upvotes: 1

Related Questions