Kishan
Kishan

Reputation: 362

how rank is calculated in pandas

I confuse to understand rank of series. I know that rank is calculated from the highest value to lowest value in a series. If two numbers are equal, then pandas calculates the average of the numbers.

In this example, the highest value is 7. why do we get rank 5.5 for number 7 and rank 1.5 for number 4 ?

S1 = pd.Series([7,6,7,5,4,4])
S1.rank()

Output:

0    5.5
1    4.0
2    5.5
3    3.0
4    1.5
5    1.5
dtype: float64

Upvotes: 4

Views: 2321

Answers (3)

Omar
Omar

Reputation: 1111

You were performing default rank if you want max rank the follow as below

S1 = pd.Series([7,6,7,5,4,4])
S1.rank(method='max')

Here is all rank supported by pandas

methods : {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, and default is ‘average’

S1['default_rank'] = S1.rank()
S1['max_rank'] = S1.rank(method='max')
S1['NA_bottom'] = S1.rank(na_option='bottom')
S1['pct_rank'] = S1.rank(pct=True)
print(S1)

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150765

As commented by Joachim, the rank function accepts an argument method with default 'average'. That is, the final rank is the average of all the rank of the same values.

Per the document, other options of method are:

method : {'average', 'min', 'max', 'first', 'dense'}, default 'average' How to rank the group of records that have the same value (i.e. ties):

  • average: average rank of the group
  • min: lowest rank in the group
  • max: highest rank in the group
  • first: ranks assigned in order they appear in the array
  • dense: like 'min', but rank always increases by 1 between groups numeric_only : bool, optional

For example, let's try: method='dense', then S1.rank(method='dense') gives:

0    4.0
1    3.0
2    4.0
3    2.0
4    1.0
5    1.0
dtype: float64

which is somewhat equivalent to factorize.


Update: per your question, let's try writing a function that behaves similar to S1.rank():

def my_rank(s):
    # sort s by values
    s_sorted = s.sort_values(kind='mergesort')

    # this is the incremental ranks
    # equivalent to s.rank(method='first')
    ranks = pd.Series(np.arange(len(s_sorted))+1, index=s_sorted.index)

    # averaged ranks
    avg_ranks = ranks.groupby(s_sorted).transform('mean')

    return avg_ranks

Upvotes: 2

vvk24
vvk24

Reputation: 490

The Rank is calculated in this way

  1. Arrange the elements in ascending order and the ranks are assigned starting with '1' for the lowest element.
Elements - 4, 4, 5, 6, 7, 7
Ranks    - 1, 2, 3, 4, 5, 6
  1. Now consider the repeating items, average out the corresponding ranks and assign the averaged rank to them.

Since we have '4' repeating twice, the final rank of each occurrence will be the average of 1,2 which is 1.5. In the same way or 7, final rank for each occurrence will be average of 5,6 which is 5.5

Elements -   4,   4,   5, 6, 7,   7
Ranks    -   1,   2,   3, 4, 5,   6
Final Rank - 1.5, 1.5, 3, 4, 5.5, 5.5

Upvotes: 5

Related Questions