Reputation: 56881
I am trying to understand percentiles in numpy.
import numpy as np
nd_array = np.array([3.6216, 4.5459, -3.5637, -2.5419])
step_intervals = range(100, 0, -5)
for percentile_interval in step_intervals:
threshold_attr_value = np.percentile(np.array(nd_array), percentile_interval)
print "percentile interval ={interval}, threshold_attr_value = {threshold_attr_value}, {arr}".format(interval=percentile_interval, threshold_attr_value=threshold_attr_value, arr=sorted(nd_array))
I get a value of these as
percentile interval =100, threshold_attr_value = 4.5459, [-3.5636999999999999, -2.5419, 3.6215999999999999, 4.5458999999999996]
...
percentile interval =5, threshold_attr_value = -3.41043, [-3.5636999999999999, -2.5419, 3.6215999999999999, 4.5458999999999996]
What does the percentiles value mean?
Is that the correct way to read these?
I want to split the numpy array into small sub-arrays. I want to do it based on the percentile occurances of the elements. How can I do this?
Upvotes: 3
Views: 3071
Reputation: 18201
No, as you can see by inspection, only 75% of the values in your array are strictly less than 4.5459, and 25% of the values are strictly less than -3.41043. If you had written less than or equal to, then you would have been giving one common definition of "Percentile" which however happens to also not be what is applied in your case; instead, what's happening is that numpy
is applying a certain interpolation scheme to ensure that the mapping taking a given number in [0, 100] to the corresponding percentile is continuous and piecewise linear, while still giving the "right" value at ranks corresponding to values in the given array. As it turns out, even this you can do in many different ways, all of which are reasonable, as described in the Wikipedia article on the subject. As you can see in the documentation of numpy.percentile
, you have some control of the interpolation behaviour and by default it uses what the Wikipedia article calls the "second variant, $C = 1$".
Perhaps the easiest way to understand the implications of this is to simply plot the result of calculating the different values of np.percentile
for your fixed length 4 array:
Note how the kinks are spread evenly across [0, 100] and that the percentiles corresponding to the actual values in your array are given by evaluating lambda p: np.percentile(nd_array, p)
at 0*100/(4-1), 1*100/(4-1), 2*100/(4-1), and 3*100/(4-1) respectively.
Upvotes: 2
Reputation: 2111
To be more precise, you should say that a = np.percentile(arr, q)
indicates that nearly q%
of elements of arr
are lower than a
. Why do I emphasize on nearly?
q=100
, it always returns the maximum of arr
. So, you cannot say that q%
of elements are "lower than" a
.q=0
, it always returns the minimum of arr
. So, you cannot say that q%
of elements are "lower than or equal to" a
.The following code shows the role of interpolation parameter:
>>> import numpy as np
>>> arr = np.array([1,2,3,4,5])
>>> np.percentile(arr, 90) # default interpolation='linear'
4.5999999999999996
>>> np.percentile(arr, 90, interpolation='lower')
4
>>> np.percentile(arr, 90, interpolation='higher')
5
Upvotes: 2