freebie
freebie

Reputation: 1967

Numpy 2d histogram not summing to 1

I think I'm misunderstanding Numpy's histogram2d range and bin arguments.

Here's a example of it working how I'd expect:

d, x_r, y_r = np.histogram2d(
    [0, 1, 3], 
    [0, 1, 3], 
    bins=[3, 3], 
    range=[[0, 3], [0, 3]], 
    normed=True)

d
array([[ 0.33333333,  0.        ,  0.        ],
       [ 0.        ,  0.33333333,  0.        ],
       [ 0.        ,  0.        ,  0.33333333]])
np.sum(d)
1.0

And here's where things start to fall apart for me (increasing bin count):

d, x_r, y_r = np.histogram2d(
    [0, 1, 3], 
    [0, 1, 3], 
    bins=[3, 6], 
    range=[[0, 3], [0, 3]], 
    normed=True)
d
array([[ 0.66666667,  0.,  0.        ,  0.,  0., 0.        ],
       [ 0.        ,  0.,  0.66666667,  0.,  0., 0.        ],
       [ 0.        ,  0.,  0.        ,  0.,  0., 0.66666667]])
np.sum(d)
2.0

I would have expected:

d
array([[ 0.33333333,  0.,  0.        ,  0.,  0., 0.        ],
       [ 0.        ,  0.,  0.33333333,  0.,  0., 0.        ],
       [ 0.        ,  0.,  0.        ,  0.,  0., 0.33333333]])

Would appreciate any help on understanding this and getting the result I'm looking for. Thanks.

Upvotes: 1

Views: 288

Answers (1)

dermen
dermen

Reputation: 5382

The normed arg in np.histogram2d normalizes as follows

bin_count / sample_count / bin_area

These take a while to understand, and the source code is not written very well in my opinion (poorly chosen variable names)

  • bin_count is is the value in the histogram bin
  • sample_count is the total sum of all bin_counts
  • bin_area is the area of the particular bin

We can define the above 3 variables in both cases without using the normed arg, and see whats going on:

Case 1

bin_count, binsx, binsy = np.histogram2d( [0,1,3], [0,1,3], 
    bins=[3,3], range=[[0,3],[0,3]], normed=False)

If you look at binsx and binsy you will see the area of each bin is 1

print(binsx, binsy)
#In [54]: print (binsx, binsy)
#(array([ 0.,  1.,  2.,  3.]), array([ 0.,  1.,  2.,  3.]))

Therefore we let bin_area=1 and the 2D histogram normalized looks like

bin_count / bin_count.sum() / bin_area

#array([[ 0.33333333,  0.        ,  0.        ],
       #[ 0.        ,  0.33333333,  0.        ],
       #[ 0.        ,  0.        ,  0.33333333]])

Case 2

bin_count, binsx, binsy = np.histogram2d( [0,1,3], [0,1,3], 
    bins=[3,6], range=[[0,3],[0,3]], normed=False)
print(binsx, binsy)
#(array([ 0.,  1.,  2.,  3.]), array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ]))

Now you can see your bin_area has decreased by a factor of 2 (because you increased the number of y-bins by a factor of 2)

Hence , we let bin_area=.5, and the normalized hist looks like

bin_count / bin_count.sum() / bin_area

#array([[ 0.66666667,  0.        ,  0.        ,  0.        ,  0.        ,
#         0.        ],
#       [ 0.        ,  0.        ,  0.66666667,  0.        ,  0.        ,
#         0.        ],
#       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
#         0.66666667]])

General Case

In general you can have bins of variable sizes, hence bin_area might be a variable. Consider some non-even bins:

bin_count, binsx, binsy = np.histogram2d( [0,1,3], [0,1,3], 
    bins=([0.,1.5,3.],[0, .6, 1.7,3.]), 
    range=[[0,3],[0,3]], normed=False)

In this case, calculate the area of each bin explicitly:

bin_area = np.array( [ [(x1 -x0)* (y1-y0) 
    for y1,y0 in zip(binsy[1:], binsy[:-1])] 
        for x1,x0 in zip(binsx[1:], binsx[:-1]) ] )

print(bin_area)
#array([[ 0.9 ,  1.65,  1.95],
#       [ 0.9 ,  1.65,  1.95]])

bin_count / bin_count.sum() / bin_area
#array([[ 0.37037037,  0.2020202 ,  0.        ],
#       [ 0.        ,  0.        ,  0.17094017]])

Indeed, if we set the normed arg to True

normed_bin_count, binsx, binsy = np.histogram2d( [0,1,3], [0,1,3], 
    bins=([0.,1.5,3.],[0, .6, 1.7,3.]), 
    range=[[0,3],[0,3]], normed=True)
print(normed_bin_count)
#array([[ 0.37037037,  0.2020202 ,  0.        ],
#       [ 0.        ,  0.        ,  0.17094017]])

Upvotes: 1

Related Questions