vikkky
vikkky

Reputation: 9

Mean calculation within bins

I am trying to calculate mean within each bin. Everything works okay, but I get 'nan' mean value in the first bin. I suppose it is not correct. Can you help me to find a mistake?

Here is my code:

data = np.array([-90,-1,2,3,5,6,8,10,121])

bin_s = np.array([-np.inf, 1, 3, 5, 8, 9, +np.inf])

dig = np.digitize(data, bin_s)
sol = np.bincount(dig, data) / np.bincount(dig)
sol

Code and result

Upvotes: 0

Views: 2224

Answers (2)

ewcz
ewcz

Reputation: 13087

Numpy's bincount returns the populations of individual bins. However, if some bins are empty, the corresponding value will be zero, thus the division by np.bincount(dig) will fail. A quick fix would be

 sol = np.bincount(dig, data) / np.array([max(1, v) for v in np.bincount(dig)])

i.e., to divide by 1 instead of 0 for such bins since in this case we know that the bin is empty and thus the corresponding value in np.bincount(dig, data) is also zero (however, this would depend on how you want to interpret the mean of an empty bin). This will give:

[  0.  -45.5   2.    3.    5.5   8.   65.5]

The first element here is not phony, but it corresponds to the zero bin index which would aggregate data smaller than min(bin_s). However, since this number is in your case -np.inf, there are no such data. But it might happen that even some intermediary bins turn out to be empty. For example if you take as the input data:

data = np.array([-90,-1,2,3,10,121])

Then np.bincount returns [0 2 1 1 0 0 2], so one needs to handle the other zeros as well, not only disregard the first element...

Also, you might consider binned_statistic provided by scipy which does this directly:

import numpy as np
from scipy.stats import binned_statistic as bstat

data = np.array([-90,-1,2,3,5,6,8,10,121])

stat = bstat(data, data, statistic = 'mean', bins = [-np.inf, 1, 3, 5, 8, 9, +np.inf])
print(stat[0])

Upvotes: 2

DYZ
DYZ

Reputation: 57033

The first bin is phony, disregard it:

np.bincount(dig, data)[1:] / np.bincount(dig)[1:]
#array([-45.5,   2. ,   3. ,   5.5,   8. ,  65.5])

Upvotes: 0

Related Questions