GRquanti
GRquanti

Reputation: 554

Matplotlib: multiple boxplot with (multiple) broken axis

First of all I show you what I need: I need a boxplot with broken x-axis, possibily more than a single break. An example is this figure enter image description here

Now: I have two list of the form X and Y ( X = float, Y = int). First I group Y in sublists according to the integer part of X (X and Y are the same length):

number_of_units = int(max(X)) + 1
my_data = []
for i in range(number_of_units):
  my_data.append([])

for i in range(len(X)):
  j = int(X[i] )
  my_data[j].append(Y[i])

In this way my_data is a list of lists, with number_of_units sublists. The k-th subslist contains all the X values that are associated to Y values whose integer part is k. Here the problem: most of the subslists are empty: Y spans many orders of magnitude and typical values of number_of_units is 10^5, but most of the Y have integer part in [1,10] so that most of the sublists in my_data are empty. The direct consequence is that if I do

fig, ax = plt.subplots()
ax.boxplot(my_data, 'options')

I obtain something like the following figure (note the "upper-right" red point):

enter image description here

This is due to the emptyness of most of the sublists in my_data: most of the plot shows "zero-frequency". So what I need is to break the x-axis of the plot whenever the frequency is zero. Note that:

Theoretical idea

  1. Split the list my_data into M lists of lists, where the split has to be done according to the emptyness of my_data: if my_data[k] is the first empty sublist, than my_data[0],...,my_data[k-1] is the first group; then find the first non empty sublist with index >k and there the second group begins. When I find another empty sublists, the second group is formed and so on. I hope I was clear.

  2. Do a ax.boxplot() for each of the new list of lists. This time none of the sublists will be empty.

  3. Plot each ax as subplots and join all the subplots as suggested here.

This approach has a number of difficulties to me. The main problem is that I don't know a priori the number of subplots I will need, this number depending on the dataset and this is a problem I really don't know how to overcome. So I ask:

How can I authomatically locate the regions of the X-axis that have non-zero frequency and plot only those regions, with an underlying broken ax everytime the regions end?

Any suggestion would be appreciated.

EDIT

My question is not a duplicate of this questions because the latter does not contains any explanation on how to break the X axis. However the combination of the information in questions 1 and 2 might fully solve the problem. I'm actually working on it and I will edit the question further when the problem will be solved.

Upvotes: 0

Views: 1524

Answers (2)

GRquanti
GRquanti

Reputation: 554

Consider a data sample built like this:

import numpy as np
from pylab import *
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from itertools import *
from operator import itemgetter
import scipy.stats as stats

def truncated_power_law(a, m):
x = np.arange(1, m+1, dtype='float')
pmf = 1/x**a
pmf /= pmf.sum()
return stats.rv_discrete(values=(range(1, m+1), pmf))

a, m = 2, 100000
d = truncated_power_law(a=a, m=m)
N = 10**2

X = np.sort(np.asarray(list(set(d.rvs(size=N)))))
Y = []
for i in range(0,len(X)):
Y.append(i*np.random.rand(100))

Don't care nothing about the data except that X is power law distributed. This imples that a lot of values between min(X) and max(X) don't appear in the sample.

Now, if you limit yourself doing

m_props = {'color': 'red',}
b_props = {'color': 'black', 'linestyle': '-'}
w_props = {'color': 'black', 'linestyle': '-'}
c_props = {'color': 'black', 'linestyle': '-'}

f_ugly, ax_ugly = plt.subplots()
ax_ugly.boxplot(Y, notch = 0, sym = '', positions = X, medianprops = 
        m_props, boxprops = b_props, whiskerprops = w_props, capprops 
        = c_props)

You obtain something like this:bad_box

Now consider this:

#X is divided in sublists of consecutive values
dominiums = []
for k, g in groupby(enumerate(X), lambda (i,j):i-j):
    dominiums.append(map(itemgetter(1), g))

number_of_subplots = len(dominiums)

k = 0
d = .01
l = .015

f, axes = plt.subplots(nrows = 1, ncols = number_of_subplots, sharex = 
              False, sharey = True, gridspec_kw = {'width_ratios':
              [3*len(dominiums[h]) for h in 
              range(number_of_subplots)],'wspace':0.05})

axes[0].yaxis.tick_left()
axes[0].spines['right'].set_visible(False)

kwargs = dict(transform = axes[0].transAxes, color='k', linewidth = 1, 
         clip_on = False)
axes[0].plot((1-d/1.5,1+d/1.5), (-d,+d), **kwargs)
axes[0].plot((1-d/1.5,1+d/1.5),(1-d,1+d), **kwargs)
kwargs.update(transform = axes[-1].transAxes)
axes[-1].plot((-l,+l), (1-d,1+d), **kwargs)
axes[-1].plot((-l,+l), (-d,+d), **kwargs)

for i in range(number_of_subplots):
    data_in_this_subplot = []
    for j in range(len(dominiums[i])):
        data_in_this_subplot.append([])
        data_in_this_subplot[j] = Y[k]
        k = k + 1

    axes[i].boxplot(data_in_this_subplot, notch = 0, sym = '', 
            positions = dominiums[i], medianprops = m_props, boxprops 
            = b_props, whiskerprops = w_props, capprops = c_props)

    if i != 0:
        axes[i].spines['left'].set_visible(False)
        axes[i].tick_params(axis = 'y', which = 'both', labelright = 
                False, length = 0)
    if i != number_of_subplots -1:
        axes[i].spines['right'].set_visible(False)
        kwargs = dict(transform = axes[i].transAxes, color='k', 
                 linewidth = 1, clip_on=False)
        axes[i].plot((1-l,1+l), (-d,+d), **kwargs)
        axes[i].plot((1-l,1+l),(1-d,1+d), **kwargs)
        kwargs.update(transform = axes[i].transAxes)
        axes[i].plot((-l,+l), (1-d,1+d), **kwargs)
        axes[i].plot((-l,+l), (-d,+d), **kwargs)

Using the same data of the first figure, the latter code produces the following: good box

IMHO this code fully answer to the question: it authomatically locate the relevant regions of the X axis and plot only those regions, whit an undelrlying broken ax everytime the region ends.

Weankess of the solution: it has a number of arbitrary parameters that must be tuned for every different data set (e.g. d,l, the number 3 in 3*len(dominiums[h])

Strenght of the solution: you don't need to know a priori the number of relevant regions (i.e. the number of subplots)

Thanks to wwii for his usefoul answer and comments.

Upvotes: 1

wwii
wwii

Reputation: 23753

Without further evidence (your question lacks a minimal example of Xand Y), it looks like X and Y values are related to each other by their positions/indices and you are trying to preserve that relationship by placing Y values in my_data at the index of the related X value. I imagine you are doing that so you don't have to pass the X values to .boxplot() but that creates a lot of empty space that you don't want in your visualization.

If your data looks similar to this fake data:

X = [1,2,3,9,10,11,50,51,52]
Y = [590, 673, 49, 399, 551, 19, 618, 358, 106, 84,
     537, 865, 507, 862, 905, 335, 195, 250, 54, 497,
     224, 612, 4, 16, 423, 52, 222, 421, 562, 140, 324,
     599, 295, 836, 887, 222, 790, 860, 917, 100, 348,
     141, 221, 575, 48, 411, 0, 245, 635, 631, 349, 646]

The relationship between X, Y, and my_data can be seen by adding a print statement to the for loop that constructs my_data:

....
    my_data[j].append(Y[i])
    print(f'X[{i}]:{X[i]:<6}Y[{i}]:{Y[i]:<6}my_data[{j}:{my_data[j]}')  

>>>
X[0]:1     Y[0]:590   my_data[1:[590]
X[1]:2     Y[1]:673   my_data[2:[673]
X[2]:3     Y[2]:49    my_data[3:[49]
X[3]:9     Y[3]:399   my_data[9:[399]
X[4]:10    Y[4]:551   my_data[10:[551]
X[5]:11    Y[5]:19    my_data[11:[19]
X[6]:50    Y[6]:618   my_data[50:[618]
X[7]:51    Y[7]:358   my_data[51:[358]
X[8]:52    Y[8]:106   my_data[52:[106]

>>>

You would probably be better off not creating the empty space in the first place and just pass the x's and y's to .plot using X as the argument for 'plot's positions parameter

# again fake Y data
y_s = [[thing] for thing in Y[:len(X)]]
plt.boxplot(y_s, positions=X)

This still leaves a lot of empty space in the plot. This can be fixed by segregating X and Y to slices of contiguous X values then creating subplots of the fragments using a loop (see Dynamically add/create subplots in matplotlib)

Upvotes: 0

Related Questions