merlin2011
merlin2011

Reputation: 75575

How do I get the index of a specific percentile in numpy / scipy?

I have looked this answer which explains how to compute the value of a specific percentile, and this answer which explains how to compute the percentiles that correspond to each element.

However, both require an additional scan if I want to know the index (in the original array) that corresponds to a particular percentile (or the index containing the element closest to that index).

Is there is more direct or built-in way to get the index which corresponds to a percentile?

Note: My array is not sorted and I want the index in the original, unsorted array.

Upvotes: 21

Views: 20257

Answers (6)

ClimateUnboxed
ClimateUnboxed

Reputation: 8087

If numpy is to be used, one can use the built-in percentile function, but the way you do this depends on the version you have (very old <v1.9.0, old < 1.22 or new >=1.22)

From v1.22.0 of numpy you can write

np.percentile(x,p,method="method") 

with method chosen from:

  • ‘inverted_cdf’

  • ‘averaged_inverted_cdf’

  • ‘closest_observation’

  • ‘interpolated_inverted_cdf’

  • ‘hazen’

  • ‘weibull’

  • ‘linear’ (default)

  • ‘median_unbiased’

  • ‘normal_unbiased’

For older versions before v1.22

NOTE: The original answer below is depreciated from numpy v1.22.0 onwards - the argument interpolation is now depreciated and is renamed method - the lower, higher and nearest methods are retained for backwards compatibility but are now in method linear. New methods have now been added, see the man page for details.

From version 1.9.0 of numpy, percentile has the option "interpolation" that allows you to pick out the lower/higher/nearest percentile value. The following will work on unsorted arrays and finds the nearest percentile index:

import numpy as np
p=70 # my desired percentile, here 70% 
x=np.random.uniform(10,size=(1000))-5.0  # dummy vector

# index of array entry nearest to percentile value
pcen=np.percentile(x,p,interpolation='nearest')
i_near=abs(x-pcen).argmin()

Most people will normally want the nearest percentile value as stated above. But just for completeness, you can also easily specify to get the entry below or above the stated percentile value:

# Use this to get index of array entry greater than percentile value:
pcen=np.percentile(x,p,interpolation='higher')

# Use this to get index of array entry smaller than percentile value:
pcen=np.percentile(x,p,interpolation='lower')

For EXTREMELY OLD versions of numpy < v1.9.0, the interpolation option is not available, and thus the equivalent is this:

# Calculate 70th percentile:
pcen=np.percentile(x,p)
i_high=np.asarray([i-pcen if i-pcen>=0 else x.max()-pcen for i in x]).argmin()
i_low=np.asarray([i-pcen if i-pcen<=0 else x.min()-pcen for i in x]).argmax()
i_near=abs(x-pcen).argmin()

In summary:

i_high points to the array entry which is the next value equal to, or greater than, the requested percentile.

i_low points to the array entry which is the next value equal to, or smaller than, the requested percentile.

i_near points to the array entry that is closest to the percentile, and can be larger or smaller.

My results are:

pcen

2.3436832738049946

x[i_high]

2.3523077864975441

x[i_low]

2.339987054079617

x[i_near]

2.339987054079617

i_high,i_low,i_near

(876, 368, 368)

i.e. location 876 is the closest value exceeding pcen, but location 368 is even closer, but slightly smaller than the percentile value.

Upvotes: 6

runDOSrun
runDOSrun

Reputation: 10995

You can use numpy's np.percentile as such:

import numpy as np

percentile = 75
mylist = [random.random() for i in range(100)] # random list
    
percidx = mylist.index(np.percentile(mylist, percentile, interpolation='nearest'))

Upvotes: 3

sharma0611
sharma0611

Reputation: 57

Using numpy,

arr = [12, 19, 11, 28, 10]
p = 0.75
np.argsort(arr)[int((len(arr) - 1) * p)]

This returns 11, as desired.

Upvotes: 3

metaditch
metaditch

Reputation: 63

You can select the values in a df in a designated quantile with df.quantile().

df_metric_95th_percentile = df.metric[df >= df['metric'].quantile(q=0.95)]

Upvotes: 1

Jaime
Jaime

Reputation: 67427

It is a little convoluted, but you can get what you are after with np.argpartition. Lets take an easy array and shuffle it:

>>> a = np.arange(10)
>>> np.random.shuffle(a)
>>> a
array([5, 6, 4, 9, 2, 1, 3, 0, 7, 8])

If you want to find e.g. the index of quantile 0.25, this would correspond to the item in position idx of the sorted array:

>>> idx = 0.25 * (len(a) - 1)
>>> idx
2.25

You need to figure out how to round that to an int, say you go with nearest integer:

>>> idx = int(idx + 0.5)
>>> idx
2

If you now call np.argpartition, this is what you get:

>>> np.argpartition(a, idx)
array([7, 5, 4, 3, 2, 1, 6, 0, 8, 9], dtype=int64)
>>> np.argpartition(a, idx)[idx]
4
>>> a[np.argpartition(a, idx)[idx]]
2

It is easy to check that these last two expressions are, respectively, the index and the value of the .25 quantile.

Upvotes: 12

Greg Nisbet
Greg Nisbet

Reputation: 6994

Assuming the array is sorted... Unless I'm misunderstanding you, you can compute the index of a percentile by taking the length of the array -1, multiplying it by the quantile, and rounding to the nearest integer.

round( (len(array) - 1) * (percentile / 100.) )

should give you the nearest index to that percentile

Upvotes: 1

Related Questions