Reputation: 554
First of all I show you what I need: I need a boxplot with broken x-axis, possibily more than a single break. An example is this figure
Now: I have two list of the form X
and Y
( X = float
, Y = int
). First I group Y
in sublists according to the integer part of X
(X
and Y
are the same length):
number_of_units = int(max(X)) + 1
my_data = []
for i in range(number_of_units):
my_data.append([])
for i in range(len(X)):
j = int(X[i] )
my_data[j].append(Y[i])
In this way my_data
is a list of lists, with number_of_units
sublists. The k
-th subslist contains all the X
values that are associated to Y
values whose integer part is k
. Here the problem: most of the subslists are empty: Y
spans many orders of magnitude and typical values of number_of_units
is 10^5
, but most of the Y
have integer part in [1,10]
so that most of the sublists in my_data
are empty. The direct consequence is that if I do
fig, ax = plt.subplots()
ax.boxplot(my_data, 'options')
I obtain something like the following figure (note the "upper-right" red point):
This is due to the emptyness of most of the sublists in my_data
: most of the plot shows "zero-frequency". So what I need is to break the x-axis of the plot whenever the frequency is zero. Note that:
Theoretical idea
Split the list my_data
into M
lists of lists, where the split has to be done according to the emptyness of my_data
: if my_data[k]
is the first empty sublist, than my_data[0],...,my_data[k-1]
is the first group; then find the first non empty sublist with index >k
and there the second group begins. When I find another empty sublists, the second group is formed and so on. I hope I was clear.
Do a ax.boxplot()
for each of the new list of lists. This time none of the sublists will be empty.
Plot each ax
as subplots and join all the subplots as suggested here.
This approach has a number of difficulties to me. The main problem is that I don't know a priori the number of subplots I will need, this number depending on the dataset and this is a problem I really don't know how to overcome. So I ask:
How can I authomatically locate the regions of the X-axis that have non-zero frequency and plot only those regions, with an underlying broken ax everytime the regions end?
Any suggestion would be appreciated.
EDIT
My question is not a duplicate of this questions because the latter does not contains any explanation on how to break the X axis. However the combination of the information in questions 1 and 2 might fully solve the problem. I'm actually working on it and I will edit the question further when the problem will be solved.
Upvotes: 0
Views: 1524
Reputation: 554
Consider a data sample built like this:
import numpy as np
from pylab import *
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from itertools import *
from operator import itemgetter
import scipy.stats as stats
def truncated_power_law(a, m):
x = np.arange(1, m+1, dtype='float')
pmf = 1/x**a
pmf /= pmf.sum()
return stats.rv_discrete(values=(range(1, m+1), pmf))
a, m = 2, 100000
d = truncated_power_law(a=a, m=m)
N = 10**2
X = np.sort(np.asarray(list(set(d.rvs(size=N)))))
Y = []
for i in range(0,len(X)):
Y.append(i*np.random.rand(100))
Don't care nothing about the data except that X
is power law distributed. This imples that a lot of values between min(X)
and max(X)
don't appear in the sample.
Now, if you limit yourself doing
m_props = {'color': 'red',}
b_props = {'color': 'black', 'linestyle': '-'}
w_props = {'color': 'black', 'linestyle': '-'}
c_props = {'color': 'black', 'linestyle': '-'}
f_ugly, ax_ugly = plt.subplots()
ax_ugly.boxplot(Y, notch = 0, sym = '', positions = X, medianprops =
m_props, boxprops = b_props, whiskerprops = w_props, capprops
= c_props)
You obtain something like this:
Now consider this:
#X is divided in sublists of consecutive values
dominiums = []
for k, g in groupby(enumerate(X), lambda (i,j):i-j):
dominiums.append(map(itemgetter(1), g))
number_of_subplots = len(dominiums)
k = 0
d = .01
l = .015
f, axes = plt.subplots(nrows = 1, ncols = number_of_subplots, sharex =
False, sharey = True, gridspec_kw = {'width_ratios':
[3*len(dominiums[h]) for h in
range(number_of_subplots)],'wspace':0.05})
axes[0].yaxis.tick_left()
axes[0].spines['right'].set_visible(False)
kwargs = dict(transform = axes[0].transAxes, color='k', linewidth = 1,
clip_on = False)
axes[0].plot((1-d/1.5,1+d/1.5), (-d,+d), **kwargs)
axes[0].plot((1-d/1.5,1+d/1.5),(1-d,1+d), **kwargs)
kwargs.update(transform = axes[-1].transAxes)
axes[-1].plot((-l,+l), (1-d,1+d), **kwargs)
axes[-1].plot((-l,+l), (-d,+d), **kwargs)
for i in range(number_of_subplots):
data_in_this_subplot = []
for j in range(len(dominiums[i])):
data_in_this_subplot.append([])
data_in_this_subplot[j] = Y[k]
k = k + 1
axes[i].boxplot(data_in_this_subplot, notch = 0, sym = '',
positions = dominiums[i], medianprops = m_props, boxprops
= b_props, whiskerprops = w_props, capprops = c_props)
if i != 0:
axes[i].spines['left'].set_visible(False)
axes[i].tick_params(axis = 'y', which = 'both', labelright =
False, length = 0)
if i != number_of_subplots -1:
axes[i].spines['right'].set_visible(False)
kwargs = dict(transform = axes[i].transAxes, color='k',
linewidth = 1, clip_on=False)
axes[i].plot((1-l,1+l), (-d,+d), **kwargs)
axes[i].plot((1-l,1+l),(1-d,1+d), **kwargs)
kwargs.update(transform = axes[i].transAxes)
axes[i].plot((-l,+l), (1-d,1+d), **kwargs)
axes[i].plot((-l,+l), (-d,+d), **kwargs)
Using the same data of the first figure, the latter code produces the following:
IMHO this code fully answer to the question: it authomatically locate the relevant regions of the X axis and plot only those regions, whit an undelrlying broken ax everytime the region ends.
Weankess of the solution: it has a number of arbitrary parameters that must be tuned for every different data set (e.g. d,l
, the number 3
in 3*len(dominiums[h]
)
Strenght of the solution: you don't need to know a priori the number of relevant regions (i.e. the number of subplots)
Thanks to wwii for his usefoul answer and comments.
Upvotes: 1
Reputation: 23753
Without further evidence (your question lacks a minimal example of X
and Y
),
it looks like X
and Y
values are related to each other by their positions/indices and you are trying to preserve that relationship by placing Y
values in my_data
at the index of the related X
value. I imagine you are doing that so you don't have to pass the X
values to .boxplot()
but that creates a lot of empty space that you don't want in your visualization.
If your data looks similar to this fake data:
X = [1,2,3,9,10,11,50,51,52]
Y = [590, 673, 49, 399, 551, 19, 618, 358, 106, 84,
537, 865, 507, 862, 905, 335, 195, 250, 54, 497,
224, 612, 4, 16, 423, 52, 222, 421, 562, 140, 324,
599, 295, 836, 887, 222, 790, 860, 917, 100, 348,
141, 221, 575, 48, 411, 0, 245, 635, 631, 349, 646]
The relationship between X
, Y
, and my_data
can be seen by adding a print statement to the for loop that constructs my_data
:
....
my_data[j].append(Y[i])
print(f'X[{i}]:{X[i]:<6}Y[{i}]:{Y[i]:<6}my_data[{j}:{my_data[j]}')
>>>
X[0]:1 Y[0]:590 my_data[1:[590]
X[1]:2 Y[1]:673 my_data[2:[673]
X[2]:3 Y[2]:49 my_data[3:[49]
X[3]:9 Y[3]:399 my_data[9:[399]
X[4]:10 Y[4]:551 my_data[10:[551]
X[5]:11 Y[5]:19 my_data[11:[19]
X[6]:50 Y[6]:618 my_data[50:[618]
X[7]:51 Y[7]:358 my_data[51:[358]
X[8]:52 Y[8]:106 my_data[52:[106]
>>>
You would probably be better off not creating the empty space in the first place and just pass the x's and y's to .plot
using X
as the argument for 'plot
's positions
parameter
# again fake Y data
y_s = [[thing] for thing in Y[:len(X)]]
plt.boxplot(y_s, positions=X)
This still leaves a lot of empty space in the plot. This can be fixed by segregating X
and Y
to slices of contiguous X
values then creating subplots of the fragments using a loop (see Dynamically add/create subplots in matplotlib)
Upvotes: 0