Patratacus
Patratacus

Reputation: 1791

How to connect boxplot median values

It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.

I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:

df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)

enter image description here

One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.

Upvotes: 8

Views: 8502

Answers (2)

Vero Mieites
Vero Mieites

Reputation: 1

You can get the value of the medians by using the .get_data() property of the matplotlib.lines.Line2D objects that draw them, without having to use seaborn.

Let bp be your boxplot created as bp=plt.boxplot(data). Then, bp is a dict containing the medians key, among others. That key contains a list of matplotlib.lines.Line2D, from which you can extract the (x,y) position as follows:

bp=plt.boxplot(data)
X=[]
Y=[]
for m in bp['medians']:
    [[x0, x1],[y0,y1]] = m.get_data()
    X.append(np.mean((x0,x1)))
    Y.append(np.mean((y0,y1)))
plt.plot(X,Y,c='C1')

For an arbitrary dataset (data), this script generates this figure. Hope it helps!

Upvotes: 0

andrew_reece
andrew_reece

Reputation: 21274

You can save the axis object that gets returned from df.boxplot(), and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot for the lines, as it handles a categorical x-axis nicely.

First let's generate some sample data:

import pandas as pd
import numpy as np
import seaborn as sns

N = 150
values = np.random.random(size=N)
groups = np.random.choice(['A','B','C'], size=N)
df = pd.DataFrame({'value':values, 'group':groups})

print(df.head())
  group     value
0     A  0.816847
1     A  0.468465
2     C  0.871975
3     B  0.933708
4     A  0.480170
              ...

Next, make the boxplot and save the axis object:

ax = df.boxplot(column='value', by='group', showfliers=True, 
                positions=range(df.group.unique().shape[0]))

Note: There's a curious positions argument in Pyplot/Pandas boxplot(), which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.

Finally, use groupby to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:

sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)

boxplot

Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby aggregation to median() if you want to plot medians instead.

Upvotes: 5

Related Questions