Senthil Kumaran
Senthil Kumaran

Reputation: 56881

What does numpy.percentile mean and how to use this for splitting array?

I am trying to understand percentiles in numpy.

import numpy as np
nd_array = np.array([3.6216, 4.5459, -3.5637, -2.5419])
step_intervals = range(100, 0, -5)

for percentile_interval in step_intervals:
    threshold_attr_value = np.percentile(np.array(nd_array), percentile_interval)
    print "percentile interval ={interval}, threshold_attr_value = {threshold_attr_value}, {arr}".format(interval=percentile_interval, threshold_attr_value=threshold_attr_value, arr=sorted(nd_array))

I get a value of these as

percentile interval =100, threshold_attr_value = 4.5459, [-3.5636999999999999, -2.5419, 3.6215999999999999, 4.5458999999999996]

...

percentile interval =5, threshold_attr_value = -3.41043, [-3.5636999999999999, -2.5419, 3.6215999999999999, 4.5458999999999996]

What does the percentiles value mean?

Is that the correct way to read these?

I want to split the numpy array into small sub-arrays. I want to do it based on the percentile occurances of the elements. How can I do this?

Upvotes: 3

Views: 3071

Answers (2)

fuglede
fuglede

Reputation: 18201

No, as you can see by inspection, only 75% of the values in your array are strictly less than 4.5459, and 25% of the values are strictly less than -3.41043. If you had written less than or equal to, then you would have been giving one common definition of "Percentile" which however happens to also not be what is applied in your case; instead, what's happening is that numpy is applying a certain interpolation scheme to ensure that the mapping taking a given number in [0, 100] to the corresponding percentile is continuous and piecewise linear, while still giving the "right" value at ranks corresponding to values in the given array. As it turns out, even this you can do in many different ways, all of which are reasonable, as described in the Wikipedia article on the subject. As you can see in the documentation of numpy.percentile, you have some control of the interpolation behaviour and by default it uses what the Wikipedia article calls the "second variant, $C = 1$".

Perhaps the easiest way to understand the implications of this is to simply plot the result of calculating the different values of np.percentile for your fixed length 4 array:

enter image description here

Note how the kinks are spread evenly across [0, 100] and that the percentiles corresponding to the actual values in your array are given by evaluating lambda p: np.percentile(nd_array, p) at 0*100/(4-1), 1*100/(4-1), 2*100/(4-1), and 3*100/(4-1) respectively.

Upvotes: 2

Hossein
Hossein

Reputation: 2111

To be more precise, you should say that a = np.percentile(arr, q) indicates that nearly q% of elements of arr are lower than a. Why do I emphasize on nearly?

  • If q=100, it always returns the maximum of arr. So, you cannot say that q% of elements are "lower than" a.
  • If q=0, it always returns the minimum of arr. So, you cannot say that q% of elements are "lower than or equal to" a.
  • In addition, the returned value depends on the type of interpolation.

The following code shows the role of interpolation parameter:

>>> import numpy as np
>>> arr = np.array([1,2,3,4,5])
>>> np.percentile(arr, 90) # default interpolation='linear'
4.5999999999999996
>>> np.percentile(arr, 90, interpolation='lower')
4
>>> np.percentile(arr, 90, interpolation='higher')
5

Upvotes: 2

Related Questions