Reputation: 7805

Making pyplot.hist() first and last bins include outliers

pyplot.hist() documentation specifies that when setting a range for a histogram "lower and upper outliers are ignored".

Is it possible to make the first and last bins of a histogram include all outliers without changing the width of the bin?

For example, let's say I want to look at the range 0-3 with 3 bins: 0-1, 1-2, 2-3 (let's ignore cases of exact equality for simplicity). I would like the first bin to include all values from minus infinity to 1, and the last bin to include all values from 2 to infinity. However, if I explicitly set these bins to span that range, they will be very wide. I would like them to have the same width. The behavior I am looking for is like the behavior of hist() in Matlab.

Obviously I can numpy.clip() the data and plot that, which will give me what I want. But I am interested if there is a builtin solution for this.

Upvotes: 23

Answers (3)

Alperino

Reputation: 558

What pyplot.hist is doing can be separated into 2 steps: calculating the distribution for the histogram and then plotting a bar chart with it. Maybe the easiest way to achieve what you want is doing the first step with numpy.histogram and using pyplot.bar directly for the second:

import numpy as np
import matplotlib.pyplot as plt

# dummy data
rng = np.random.default_rng()
x = rng.normal(loc=1, scale=0.5, size=1000)
bins = np.linspace(0,2, 11)

fig = plt.figure(layout='tight')
ax = fig.add_subplot()
ax.set_xticks(bins)

# In order to include outliers, expand the bins with -Inf and +Inf before calculating the distribution.
bins_expanded = np.concatenate([[-np.inf],bins,[np.inf]])
dist, _ = np.histogram(x, bins_expanded)

# Then, use the original bins to plot the bar chart
# Here, I made it such that the outliers are plotted as separate bars on the outmost bins.
width = bins[1]-bins[0]
ax.bar(bins[0], dist[0], bottom=dist[1], align='edge', width=width, color='red', edgecolor='black', label='left outliers')
ax.bar(bins[:-1],dist[1:-1], align='edge', width=width, color='blue', edgecolor='black', label='distribution')
ax.bar(bins[-2],dist[-1], bottom=dist[-2], align='edge', width=width, color='green', edgecolor='black', label='right_outliers')

ax.legend(loc='upper left')
plt.show()

Upvotes: 0

pelson

Reputation: 21839

No. Looking at matplotlib.axes.Axes.hist and the direct use of numpy.histogram I'm fairly confident in saying that there is no smarter solution than using clip (other than extending the bins that you histogram with).

I'd encourage you to look at the source of matplotlib.axes.Axes.hist (it's just Python code, though admittedly hist is slightly more complex than most of the Axes methods) - it is the best way to verify this kind of question.

Upvotes: 10

Benjamin Doughty

Reputation: 475

I was also struggling with this, and didn't want to use .clip() because it could be misleading, so I wrote a little function (borrowing heavily from this) to indicate that the upper and lower bins contained outliers:

def outlier_aware_hist(data, lower=None, upper=None):
    if not lower or lower < data.min():
        lower = data.min()
        lower_outliers = False
    else:
        lower_outliers = True

    if not upper or upper > data.max():
        upper = data.max()
        upper_outliers = False
    else:
        upper_outliers = True

    n, bins, patches = plt.hist(data, range=(lower, upper), bins='auto')

    if lower_outliers:
        n_lower_outliers = (data < lower).sum()
        patches[0].set_height(patches[0].get_height() + n_lower_outliers)
        patches[0].set_facecolor('c')
        patches[0].set_label('Lower outliers: ({:.2f}, {:.2f})'.format(data.min(), lower))

    if upper_outliers:
        n_upper_outliers = (data > upper).sum()
        patches[-1].set_height(patches[-1].get_height() + n_upper_outliers)
        patches[-1].set_facecolor('m')
        patches[-1].set_label('Upper outliers: ({:.2f}, {:.2f})'.format(upper, data.max()))

    if lower_outliers or upper_outliers:
        plt.legend()

You can also combine it with an automatic outlier detector (borrowed from here) like so:

def mad(data):
    median = np.median(data)
    diff = np.abs(data - median)
    mad = np.median(diff)
    return mad

def calculate_bounds(data, z_thresh=3.5):
    MAD = mad(data)
    median = np.median(data)
    const = z_thresh * MAD / 0.6745
    return (median - const, median + const)

outlier_aware_hist(data, *calculate_bounds(data))

Upvotes: 16

Making pyplot.hist() first and last bins include outliers

Answers (3)

Related Questions