Reputation: 11678
This code:
print len(my_series)
print np.percentile(my_series, 98)
print np.percentile(my_series, 99)
gives:
14221 # This is the series length
1644.2 # 98th percentile
nan # 99th percentile?
Why does 98 work fine but 99 gives nan
?
Upvotes: 11
Views: 18406
Reputation: 2279
You could be facing overflows during computation, which may explain why you're seeing NaN
s at high percentiles. In my case, I also encountered NaN
s in my code. To address this, you can use np.nanpercentile
, which handles NaN
values more robustly.
import numpy as np
# Example with NaNs in the data
data = np.array([1, 2, np.nan, 4, 5, 6])
# Calculate the 95th percentile, ignoring NaNs
percentile_95 = np.nanpercentile(data, 95)
print(f"95th percentile (ignoring NaNs): {percentile_95}")
This method ensures that any NaN
values in your dataset are excluded from the percentile calculation. If you're encountering NaN
s due to overflows or other computational issues, switching to np.nanpercentile
should help.
For more information: GitHub Comment. Thanks @Nick Chammas in comment for the highlight.
Upvotes: 0
Reputation: 2696
np.percentile treats nan's as very high numbers. So the high percentiles will be in the range where you will end up with a nan. In your case, between 1 and 2 percent of your data will be nan's (98th percentile will return you a number (which is not actually the 98th percentile of all the valid values) and the 99th will return you a nan).
To calculate the percentile without the nan's, you can use np.nanpercentile()
So:
print(np.nanpercentile(my_series, 98))
print(np.nanpercentile(my_series, 99))
Edit:
In new Numpy version, np.percentile
will return nan if nan's are present, so making this problem directly apparent. np.nanpercentile
still works the same. `
Upvotes: 20