Matplotlib: multiple boxplot with (multiple) broken axis

Question

First of all I show you what I need: I need a boxplot with broken x-axis, possibily more than a single break. An example is this figure

Now: I have two list of the form X and Y ( X = float, Y = int). First I group Y in sublists according to the integer part of X (X and Y are the same length):

number_of_units = int(max(X)) + 1
my_data = []
for i in range(number_of_units):
  my_data.append([])

for i in range(len(X)):
  j = int(X[i] )
  my_data[j].append(Y[i])

In this way my_data is a list of lists, with number_of_units sublists. The k-th subslist contains all the X values that are associated to Y values whose integer part is k. Here the problem: most of the subslists are empty: Y spans many orders of magnitude and typical values of number_of_units is 10^5, but most of the Y have integer part in [1,10] so that most of the sublists in my_data are empty. The direct consequence is that if I do

fig, ax = plt.subplots()
ax.boxplot(my_data, 'options')

I obtain something like the following figure (note the "upper-right" red point):

This is due to the emptyness of most of the sublists in my_data: most of the plot shows "zero-frequency". So what I need is to break the x-axis of the plot whenever the frequency is zero. Note that:

The points where the ax has to be broken must be found dynamically, since they change with the data.
There are very high chances that the ax has to be broken multiple times

Theoretical idea

Split the list my_data into M lists of lists, where the split has to be done according to the emptyness of my_data: if my_data[k] is the first empty sublist, than my_data[0],...,my_data[k-1] is the first group; then find the first non empty sublist with index >k and there the second group begins. When I find another empty sublists, the second group is formed and so on. I hope I was clear.
Do a ax.boxplot() for each of the new list of lists. This time none of the sublists will be empty.
Plot each ax as subplots and join all the subplots as suggested here.

This approach has a number of difficulties to me. The main problem is that I don't know a priori the number of subplots I will need, this number depending on the dataset and this is a problem I really don't know how to overcome. So I ask:

How can I authomatically locate the regions of the X-axis that have non-zero frequency and plot only those regions, with an underlying broken ax everytime the regions end?

Any suggestion would be appreciated.

EDIT

My question is not a duplicate of this questions because the latter does not contains any explanation on how to break the X axis. However the combination of the information in questions 1 and 2 might fully solve the problem. I'm actually working on it and I will edit the question further when the problem will be solved.

GRquanti · Accepted Answer

Consider a data sample built like this:

import numpy as np
from pylab import *
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from itertools import *
from operator import itemgetter
import scipy.stats as stats

def truncated_power_law(a, m):
x = np.arange(1, m+1, dtype='float')
pmf = 1/x**a
pmf /= pmf.sum()
return stats.rv_discrete(values=(range(1, m+1), pmf))

a, m = 2, 100000
d = truncated_power_law(a=a, m=m)
N = 10**2

X = np.sort(np.asarray(list(set(d.rvs(size=N)))))
Y = []
for i in range(0,len(X)):
Y.append(i*np.random.rand(100))

Don't care nothing about the data except that X is power law distributed. This imples that a lot of values between min(X) and max(X) don't appear in the sample.

Now, if you limit yourself doing

m_props = {'color': 'red',}
b_props = {'color': 'black', 'linestyle': '-'}
w_props = {'color': 'black', 'linestyle': '-'}
c_props = {'color': 'black', 'linestyle': '-'}

f_ugly, ax_ugly = plt.subplots()
ax_ugly.boxplot(Y, notch = 0, sym = '', positions = X, medianprops = 
        m_props, boxprops = b_props, whiskerprops = w_props, capprops 
        = c_props)

You obtain something like this:

Now consider this:

#X is divided in sublists of consecutive values
dominiums = []
for k, g in groupby(enumerate(X), lambda (i,j):i-j):
    dominiums.append(map(itemgetter(1), g))

number_of_subplots = len(dominiums)

k = 0
d = .01
l = .015

f, axes = plt.subplots(nrows = 1, ncols = number_of_subplots, sharex = 
              False, sharey = True, gridspec_kw = {'width_ratios':
              [3*len(dominiums[h]) for h in 
              range(number_of_subplots)],'wspace':0.05})

axes[0].yaxis.tick_left()
axes[0].spines['right'].set_visible(False)

kwargs = dict(transform = axes[0].transAxes, color='k', linewidth = 1, 
         clip_on = False)
axes[0].plot((1-d/1.5,1+d/1.5), (-d,+d), **kwargs)
axes[0].plot((1-d/1.5,1+d/1.5),(1-d,1+d), **kwargs)
kwargs.update(transform = axes[-1].transAxes)
axes[-1].plot((-l,+l), (1-d,1+d), **kwargs)
axes[-1].plot((-l,+l), (-d,+d), **kwargs)

for i in range(number_of_subplots):
    data_in_this_subplot = []
    for j in range(len(dominiums[i])):
        data_in_this_subplot.append([])
        data_in_this_subplot[j] = Y[k]
        k = k + 1

    axes[i].boxplot(data_in_this_subplot, notch = 0, sym = '', 
            positions = dominiums[i], medianprops = m_props, boxprops 
            = b_props, whiskerprops = w_props, capprops = c_props)

    if i != 0:
        axes[i].spines['left'].set_visible(False)
        axes[i].tick_params(axis = 'y', which = 'both', labelright = 
                False, length = 0)
    if i != number_of_subplots -1:
        axes[i].spines['right'].set_visible(False)
        kwargs = dict(transform = axes[i].transAxes, color='k', 
                 linewidth = 1, clip_on=False)
        axes[i].plot((1-l,1+l), (-d,+d), **kwargs)
        axes[i].plot((1-l,1+l),(1-d,1+d), **kwargs)
        kwargs.update(transform = axes[i].transAxes)
        axes[i].plot((-l,+l), (1-d,1+d), **kwargs)
        axes[i].plot((-l,+l), (-d,+d), **kwargs)

Using the same data of the first figure, the latter code produces the following:

IMHO this code fully answer to the question: it authomatically locate the relevant regions of the X axis and plot only those regions, whit an undelrlying broken ax everytime the region ends.

Weankess of the solution: it has a number of arbitrary parameters that must be tuned for every different data set (e.g. d,l, the number 3 in 3*len(dominiums[h])

Strenght of the solution: you don't need to know a priori the number of relevant regions (i.e. the number of subplots)

Thanks to wwii for his usefoul answer and comments.

Matplotlib: multiple boxplot with (multiple) broken axis

Answers (2)

Related Questions