Reputation: 1967
I think I'm misunderstanding Numpy's histogram2d range
and bin
arguments.
Here's a example of it working how I'd expect:
d, x_r, y_r = np.histogram2d(
[0, 1, 3],
[0, 1, 3],
bins=[3, 3],
range=[[0, 3], [0, 3]],
normed=True)
d
array([[ 0.33333333, 0. , 0. ],
[ 0. , 0.33333333, 0. ],
[ 0. , 0. , 0.33333333]])
np.sum(d)
1.0
And here's where things start to fall apart for me (increasing bin count):
d, x_r, y_r = np.histogram2d(
[0, 1, 3],
[0, 1, 3],
bins=[3, 6],
range=[[0, 3], [0, 3]],
normed=True)
d
array([[ 0.66666667, 0., 0. , 0., 0., 0. ],
[ 0. , 0., 0.66666667, 0., 0., 0. ],
[ 0. , 0., 0. , 0., 0., 0.66666667]])
np.sum(d)
2.0
I would have expected:
d
array([[ 0.33333333, 0., 0. , 0., 0., 0. ],
[ 0. , 0., 0.33333333, 0., 0., 0. ],
[ 0. , 0., 0. , 0., 0., 0.33333333]])
Would appreciate any help on understanding this and getting the result I'm looking for. Thanks.
Upvotes: 1
Views: 288
Reputation: 5382
The normed arg in np.histogram2d normalizes as follows
bin_count / sample_count / bin_area
These take a while to understand, and the source code is not written very well in my opinion (poorly chosen variable names)
bin_count
is is the value in the histogram binsample_count
is the total sum of all bin_countsbin_area
is the area of the particular binWe can define the above 3 variables in both cases without using the normed arg, and see whats going on:
bin_count, binsx, binsy = np.histogram2d( [0,1,3], [0,1,3],
bins=[3,3], range=[[0,3],[0,3]], normed=False)
If you look at binsx
and binsy
you will see the area of each bin is 1
print(binsx, binsy)
#In [54]: print (binsx, binsy)
#(array([ 0., 1., 2., 3.]), array([ 0., 1., 2., 3.]))
Therefore we let bin_area=1
and the 2D histogram normalized looks like
bin_count / bin_count.sum() / bin_area
#array([[ 0.33333333, 0. , 0. ],
#[ 0. , 0.33333333, 0. ],
#[ 0. , 0. , 0.33333333]])
bin_count, binsx, binsy = np.histogram2d( [0,1,3], [0,1,3],
bins=[3,6], range=[[0,3],[0,3]], normed=False)
print(binsx, binsy)
#(array([ 0., 1., 2., 3.]), array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. ]))
Now you can see your bin_area
has decreased by a factor of 2 (because you increased the number of y-bins by a factor of 2)
Hence , we let bin_area=.5
, and the normalized hist looks like
bin_count / bin_count.sum() / bin_area
#array([[ 0.66666667, 0. , 0. , 0. , 0. ,
# 0. ],
# [ 0. , 0. , 0.66666667, 0. , 0. ,
# 0. ],
# [ 0. , 0. , 0. , 0. , 0. ,
# 0.66666667]])
In general you can have bins of variable sizes, hence bin_area might be a variable. Consider some non-even bins:
bin_count, binsx, binsy = np.histogram2d( [0,1,3], [0,1,3],
bins=([0.,1.5,3.],[0, .6, 1.7,3.]),
range=[[0,3],[0,3]], normed=False)
In this case, calculate the area of each bin explicitly:
bin_area = np.array( [ [(x1 -x0)* (y1-y0)
for y1,y0 in zip(binsy[1:], binsy[:-1])]
for x1,x0 in zip(binsx[1:], binsx[:-1]) ] )
print(bin_area)
#array([[ 0.9 , 1.65, 1.95],
# [ 0.9 , 1.65, 1.95]])
bin_count / bin_count.sum() / bin_area
#array([[ 0.37037037, 0.2020202 , 0. ],
# [ 0. , 0. , 0.17094017]])
Indeed, if we set the normed arg to True
normed_bin_count, binsx, binsy = np.histogram2d( [0,1,3], [0,1,3],
bins=([0.,1.5,3.],[0, .6, 1.7,3.]),
range=[[0,3],[0,3]], normed=True)
print(normed_bin_count)
#array([[ 0.37037037, 0.2020202 , 0. ],
# [ 0. , 0. , 0.17094017]])
Upvotes: 1