Thomas Johnson
Thomas Johnson

Reputation: 11678

Why does np.percentile return NaN for high percentiles?

This code:

print len(my_series)
print np.percentile(my_series, 98)
print np.percentile(my_series, 99)

gives:

14221  # This is the series length
1644.2  # 98th percentile
nan  # 99th percentile?

Why does 98 work fine but 99 gives nan?

Upvotes: 11

Views: 18406

Answers (2)

Marine Galantin
Marine Galantin

Reputation: 2279

You could be facing overflows during computation, which may explain why you're seeing NaNs at high percentiles. In my case, I also encountered NaNs in my code. To address this, you can use np.nanpercentile, which handles NaN values more robustly.

import numpy as np

# Example with NaNs in the data
data = np.array([1, 2, np.nan, 4, 5, 6])

# Calculate the 95th percentile, ignoring NaNs
percentile_95 = np.nanpercentile(data, 95)

print(f"95th percentile (ignoring NaNs): {percentile_95}")

This method ensures that any NaN values in your dataset are excluded from the percentile calculation. If you're encountering NaNs due to overflows or other computational issues, switching to np.nanpercentile should help.

For more information: GitHub Comment. Thanks @Nick Chammas in comment for the highlight.

Upvotes: 0

Niels Henkens
Niels Henkens

Reputation: 2696

np.percentile treats nan's as very high numbers. So the high percentiles will be in the range where you will end up with a nan. In your case, between 1 and 2 percent of your data will be nan's (98th percentile will return you a number (which is not actually the 98th percentile of all the valid values) and the 99th will return you a nan).

To calculate the percentile without the nan's, you can use np.nanpercentile()

So:

print(np.nanpercentile(my_series, 98))
print(np.nanpercentile(my_series, 99))

Edit: In new Numpy version, np.percentile will return nan if nan's are present, so making this problem directly apparent. np.nanpercentile still works the same. `

Upvotes: 20

Related Questions