Al0000
Al0000

Reputation: 73

Matplotlib histogram misplaced and missing bars

I have large data files and thus am using numpy histogram (same as used in matplotlib) to manually generate histograms and update them. However, at plotting, I feel that the graph is shifted.

This is the code I use to manually create and update histograms in batches. Note that all histograms share the same bins.

temp = np.histogram(batch, bins=np.linspace(0, 40, 41))
hist += temp[0]

The code above is repeated as I parse the data files. For example, a small data set would have the following as the final histogram data:

[8190, 666, 278, 145, 113, 83, 52, 48, 45, 44, 45, 29, 28, 45, 29, 15, 16, 10, 17, 7, 15, 6, 10, 7, 3, 5, 7, 4, 2, 3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 29]

Below is the plotting code.

import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import numpy as np
plt.xticks(np.linspace(0, 1, 11))
plt.hist([i/40 for i in range(40)], bins=np.linspace(0, 1, 41), weights=scores, rwidth=0.7)
plt.yscale('log', nonposy='clip')

The resulting figure is quite strange. It shows no bar at [0.475, 0.5) and I expect the 0.975 bin which is range [0.975, 1.0] to include the last 29 values. However instead, I see that bar at the [0.950, 0.975) position. I thought this might have to do with using bins and linspace, but the size of the decoy array and weights are the same.

enter image description here

I'm never seen this kind of behavior. I also thought it would be the way the ranges are [ x, x+width), but I haven't had issues with this.

A note on using linspace. It specifies edges, so 40 bins is specified by 41 edges.

In [2]: np.linspace(0,1,41)                                                     
Out[2]: 
array([0.   , 0.025, 0.05 , 0.075, 0.1  , 0.125, 0.15 , 0.175, 0.2  ,
       0.225, 0.25 , 0.275, 0.3  , 0.325, 0.35 , 0.375, 0.4  , 0.425,
       0.45 , 0.475, 0.5  , 0.525, 0.55 , 0.575, 0.6  , 0.625, 0.65 ,
       0.675, 0.7  , 0.725, 0.75 , 0.775, 0.8  , 0.825, 0.85 , 0.875,
       0.9  , 0.925, 0.95 , 0.975, 1.   ])

In [3]: len(np.linspace(0,1,41))                                                
Out[3]: 41

Upvotes: 0

Views: 2926

Answers (2)

firework
firework

Reputation: 11

The problem is due to the rounding error of np.linspace(0, 1, 11).

bins = []
for abin in np.linspace(0, 1, 41):
    bins.append(abin)

The code above will get

bins = [0.0, 0.025, 0.05, 0.07500000000000001, 0.1, 0.125, 0.15000000000000002, ...] 

,which causes the problem.

However, when you do np.round(np.linspace(0, 1, 41), 4), the problem is fixed.

Example:

plt.hist([i/40 for i in range(40)], bins=np.round(np.linspace(0, 1, 41), 4), rwidth=1, ec='k')
plt.plot([i/40 for i in range(40)], [0.5] * 40, 'ro')
plt.xticks(np.linspace(0, 1, 11))

enter image description here

Upvotes: 1

JohanC
JohanC

Reputation: 80299

It seems you're using plt.hist with the idea to put one value into each bin, so simulating a bar plot. As the x-values fall exactly on the bin bounds, due to rounding they might end up in the neighbor bin. That could be mitigated by moving the x-values half a bin width. The simplest is drawing the bars directly.

The following code creates a bar plot with the given data, with each bar at the center of the region it represents. As a check, the bars are measured again at the end and their height displayed.

from  matplotlib.ticker import MultipleLocator
import matplotlib.pyplot as plt
import numpy as np

scores =[8190,666,278,145,113,83,52,48,45,44,45,29,28,45,29,15,16,10,17,7,15,6,10,7,3,5,7,4,2,3,0,1,0,0,0,0,0,0,0,29]
binbounds = np.linspace(0, 1, 41)
rwidth = 0.7
width = binbounds[1] - binbounds[0]
bars = plt.bar(binbounds[:-1] + width / 2, height=scores, width=width * rwidth, align='center')
plt.gca().xaxis.set_major_locator(MultipleLocator(0.1))
plt.gca().xaxis.set_minor_locator(MultipleLocator(0.05))
plt.yscale('log', nonposy='clip')
for rect in bars:
    x, y = rect.get_xy()
    w = rect.get_width()
    h = rect.get_height()
    plt.text(x + w / 2, h, f'{h}\n', ha='center', va='center')
plt.show()

resulting plot

PS: To see what's happening with the original histogram, just do a test plot without the weights:

plt.hist([i/40 for i in range(40)], bins=np.linspace(0, 1, 41), rwidth=1, ec='k')
plt.plot([i/40 for i in range(40)], [0.5] * 40, 'ro')
plt.xticks(np.linspace(0, 1, 11))

A red dot shows where the x-values are. Some fall into the correct bin, some into the neighbor which suddenly gets 2 values. histogram without weights

To create a histogram with the x-values at the center of each bin:

plt.hist([i/40 + 1/80 for i in range(40)], bins=np.linspace(0, 1, 41), rwidth=1, ec='k')
plt.plot([i/40 + 1/80 for i in range(40)], [0.5] * 40, 'ro')
plt.xticks(np.linspace(0, 1, 11))
plt.yticks([0, 1])

x-values in center of bin

Upvotes: 2

Related Questions