Reputation: 63

Operating on histogram bins Python

I am trying to find the median of values within a bin range generated by the np.histrogram function. How would I select the values only within the bin range and operate on those specific values? Below is an example of my data and what I am trying to do:

x = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]

y values can have any sort of x value associated with them, for example:

hist, bins = np.histogram(x)
hist = [129, 126, 94, 133, 179, 206, 142, 147, 90, 185] 
bins = [0.,         0.09999926, 0.19999853, 0.29999779, 0.39999706,
        0.49999632, 0.59999559, 0.69999485, 0.79999412, 0.8999933,
        0.99999265]

So, I am trying to find the median y value of the 129 values in the first bin generated, etc.

Upvotes: 3

Answers (3)

Mad Physicist

Reputation: 114320

np.digitize and np.searchsorted will match your data with bins. The latter is preferable in this situation because it does fewer unnecessary checks (your bins can safely be assumed to be sorted).

If you look at the documentation of np.histogram (Notes section), you will notice that the bins are all half-open on the right (except the last one). This means that you can do the following:

x = np.abs(np.random.normal(loc=0.75, scale=0.75, size=10000))
h, b = np.histogram(x)
ind = np.searchsorted(b, x, side='right')

Now ind contains a label for each number indicating which bin it belongs to. You can compute medians:

m = [np.median(x[ind == label]) for label in range(b.size - 1)]

If you are able to sort the input data, your job becomes easier because you can use views instead of extracting the data for each bin using masking. np.split is a good choice in this case:

x.sort()
sections = np.split(x, np.cumsum(h[:-1]))
m = [np.median(arr) for arr in sections]

Upvotes: 0

tel

Reputation: 13999

You can do this by slicing a sorted version of your data using the counts as indices:

x = np.random.rand(1000)
hist,bins = np.histogram(x)

ix = [0] + hist.cumsum().tolist()
# if don't mind sorting your original data, use x.sort() instead
xsorted = np.sort(x)
ix = [0] + hist.cumsum()
[np.median(x[i:j]) for i,j in zip(ix[:-1], ix[1:])]

which will out the medians as a standard Python list.

Upvotes: 0

Brad Solomon

Reputation: 40878

One way is with pandas.cut():

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(444)

>>> x = np.random.randint(0, 25, size=100)
>>> _, bins = np.histogram(x)
>>> pd.Series(x).groupby(pd.cut(x, bins)).median()
(0.0, 2.4]       2.0
(2.4, 4.8]       3.0
(4.8, 7.2]       6.0
(7.2, 9.6]       8.5
(9.6, 12.0]     10.5
(12.0, 14.4]    13.0
(14.4, 16.8]    15.5
(16.8, 19.2]    18.0
(19.2, 21.6]    20.5
(21.6, 24.0]    23.0
dtype: float64

If you want to stay in NumPy, you might want to check out np.digitize().

Upvotes: 2

Operating on histogram bins Python

Answers (3)

Related Questions