Reputation: 1791
It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.
I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:
df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)
One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.
Upvotes: 8
Views: 8502
Reputation: 1
You can get the value of the medians by using the .get_data()
property of the matplotlib.lines.Line2D
objects that draw them, without having to use seaborn.
Let bp
be your boxplot created as bp=plt.boxplot(data)
. Then, bp
is a dict
containing the medians
key, among others. That key contains a list
of matplotlib.lines.Line2D
, from which you can extract the (x,y) position as follows:
bp=plt.boxplot(data)
X=[]
Y=[]
for m in bp['medians']:
[[x0, x1],[y0,y1]] = m.get_data()
X.append(np.mean((x0,x1)))
Y.append(np.mean((y0,y1)))
plt.plot(X,Y,c='C1')
For an arbitrary dataset (data
), this script generates this figure. Hope it helps!
Upvotes: 0
Reputation: 21274
You can save the axis object that gets returned from df.boxplot()
, and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot
for the lines, as it handles a categorical x-axis nicely.
First let's generate some sample data:
import pandas as pd
import numpy as np
import seaborn as sns
N = 150
values = np.random.random(size=N)
groups = np.random.choice(['A','B','C'], size=N)
df = pd.DataFrame({'value':values, 'group':groups})
print(df.head())
group value
0 A 0.816847
1 A 0.468465
2 C 0.871975
3 B 0.933708
4 A 0.480170
...
Next, make the boxplot and save the axis object:
ax = df.boxplot(column='value', by='group', showfliers=True,
positions=range(df.group.unique().shape[0]))
Note: There's a curious positions
argument in Pyplot/Pandas boxplot()
, which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.
Finally, use groupby
to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:
sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)
Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby
aggregation to median()
if you want to plot medians instead.
Upvotes: 5