Reputation: 63
I am trying to find the median of values within a bin range generated by the np.histrogram
function. How would I select the values only within the bin range and operate on those specific values? Below is an example of my data and what I am trying to do:
x = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
y values can have any sort of x value associated with them, for example:
hist, bins = np.histogram(x)
hist = [129, 126, 94, 133, 179, 206, 142, 147, 90, 185]
bins = [0., 0.09999926, 0.19999853, 0.29999779, 0.39999706,
0.49999632, 0.59999559, 0.69999485, 0.79999412, 0.8999933,
0.99999265]
So, I am trying to find the median y value of the 129 values in the first bin generated, etc.
Upvotes: 3
Views: 1038
Reputation: 114320
np.digitize
and np.searchsorted
will match your data with bins. The latter is preferable in this situation because it does fewer unnecessary checks (your bins can safely be assumed to be sorted).
If you look at the documentation of np.histogram
(Notes section), you will notice that the bins are all half-open on the right (except the last one). This means that you can do the following:
x = np.abs(np.random.normal(loc=0.75, scale=0.75, size=10000))
h, b = np.histogram(x)
ind = np.searchsorted(b, x, side='right')
Now ind
contains a label for each number indicating which bin it belongs to. You can compute medians:
m = [np.median(x[ind == label]) for label in range(b.size - 1)]
If you are able to sort the input data, your job becomes easier because you can use views instead of extracting the data for each bin using masking. np.split
is a good choice in this case:
x.sort()
sections = np.split(x, np.cumsum(h[:-1]))
m = [np.median(arr) for arr in sections]
Upvotes: 0
Reputation: 13999
You can do this by slicing a sorted version of your data using the counts as indices:
x = np.random.rand(1000)
hist,bins = np.histogram(x)
ix = [0] + hist.cumsum().tolist()
# if don't mind sorting your original data, use x.sort() instead
xsorted = np.sort(x)
ix = [0] + hist.cumsum()
[np.median(x[i:j]) for i,j in zip(ix[:-1], ix[1:])]
which will out the medians as a standard Python list.
Upvotes: 0
Reputation: 40878
One way is with pandas.cut()
:
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(444)
>>> x = np.random.randint(0, 25, size=100)
>>> _, bins = np.histogram(x)
>>> pd.Series(x).groupby(pd.cut(x, bins)).median()
(0.0, 2.4] 2.0
(2.4, 4.8] 3.0
(4.8, 7.2] 6.0
(7.2, 9.6] 8.5
(9.6, 12.0] 10.5
(12.0, 14.4] 13.0
(14.4, 16.8] 15.5
(16.8, 19.2] 18.0
(19.2, 21.6] 20.5
(21.6, 24.0] 23.0
dtype: float64
If you want to stay in NumPy, you might want to check out np.digitize()
.
Upvotes: 2