Reputation: 9
I am trying to calculate mean within each bin. Everything works okay, but I get 'nan' mean value in the first bin. I suppose it is not correct. Can you help me to find a mistake?
Here is my code:
data = np.array([-90,-1,2,3,5,6,8,10,121])
bin_s = np.array([-np.inf, 1, 3, 5, 8, 9, +np.inf])
dig = np.digitize(data, bin_s)
sol = np.bincount(dig, data) / np.bincount(dig)
sol
Upvotes: 0
Views: 2224
Reputation: 13087
Numpy's bincount
returns the populations of individual bins. However, if some bins are empty, the corresponding value will be zero, thus the division by np.bincount(dig)
will fail. A quick fix would be
sol = np.bincount(dig, data) / np.array([max(1, v) for v in np.bincount(dig)])
i.e., to divide by 1 instead of 0 for such bins since in this case we know that the bin is empty and thus the corresponding value in np.bincount(dig, data)
is also zero (however, this would depend on how you want to interpret the mean of an empty bin). This will give:
[ 0. -45.5 2. 3. 5.5 8. 65.5]
The first element here is not phony, but it corresponds to the zero bin index which would aggregate data smaller than min(bin_s)
. However, since this number is in your case -np.inf
, there are no such data. But it might happen that even some intermediary bins turn out to be empty. For example if you take as the input data:
data = np.array([-90,-1,2,3,10,121])
Then np.bincount
returns [0 2 1 1 0 0 2]
, so one needs to handle the other zeros as well, not only disregard the first element...
Also, you might consider binned_statistic provided by scipy which does this directly:
import numpy as np
from scipy.stats import binned_statistic as bstat
data = np.array([-90,-1,2,3,5,6,8,10,121])
stat = bstat(data, data, statistic = 'mean', bins = [-np.inf, 1, 3, 5, 8, 9, +np.inf])
print(stat[0])
Upvotes: 2
Reputation: 57033
The first bin is phony, disregard it:
np.bincount(dig, data)[1:] / np.bincount(dig)[1:]
#array([-45.5, 2. , 3. , 5.5, 8. , 65.5])
Upvotes: 0