Reputation: 185
I am using Seaborn to make boxplots from pandas dataframes. Seaborn
boxplots seem to essentially read the dataframes the same way as the pandas
boxplot
functionality (so I hope the solution is the same for both -- but I can just use the dataframe.boxplot
function as well). My dataframe has 12 columns and the following code generates a single plot with one boxplot for each column (just like the dataframe.boxplot()
function would).
fig, ax = plt.subplots()
sns.set_style("darkgrid", {"axes.facecolor":"darkgrey"})
pal = sns.color_palette("husl",12)
sns.boxplot(dataframe, color = pal)
Can anyone suggest a simple way of overlaying all the values (by columns) while making a boxplot from dataframes? I will appreciate any help with this.
Upvotes: 4
Views: 8676
Reputation: 325
I have the following trick:
data = np.random.randn(6,5)
df = pd.DataFrame(data,columns = list('ABCDE'))
Now assign a dummy column to df:
df['Group'] = 'A'
print df
A B C D E Group
0 0.590600 0.226287 1.552091 -1.722084 0.459262 A
1 0.369391 -0.037151 0.136172 -0.772484 1.143328 A
2 1.147314 -0.883715 -0.444182 -1.294227 1.503786 A
3 -0.721351 0.358747 0.323395 0.165267 -1.412939 A
4 -1.757362 -0.271141 0.881554 1.229962 2.526487 A
5 -0.006882 1.503691 0.587047 0.142334 0.516781 A
Use the df.groupby.boxplot()
, you get it done.
df.groupby('Group').boxplot()
Upvotes: -1
Reputation: 49022
This hasn't been added to the seaborn.boxplot
function yet, but there's something similar in the seaborn.violinplot
function, which has other advantages:
x = np.random.randn(30, 6)
sns.violinplot(x, inner="points")
sns.despine(trim=True)
Upvotes: 6
Reputation: 54380
A general solution for the boxplot for the entire dataframe, which should work for both seaborn
and pandas
as their are all matplotlib
based under the hood, I will use pandas
plot as the example, assuming import matplotlib.pyplot as plt
already in place. As you have already have the ax
, it would make better sense to just use ax.text(...)
instead of plt.text(...)
.
In [35]:
print df
V1 V2 V3 V4 V5
0 0.895739 0.850580 0.307908 0.917853 0.047017
1 0.931968 0.284934 0.335696 0.153758 0.898149
2 0.405657 0.472525 0.958116 0.859716 0.067340
3 0.843003 0.224331 0.301219 0.000170 0.229840
4 0.634489 0.905062 0.857495 0.246697 0.983037
5 0.573692 0.951600 0.023633 0.292816 0.243963
[6 rows x 5 columns]
In [34]:
df.boxplot()
for x, y, s in zip(np.repeat(np.arange(df.shape[1])+1, df.shape[0]),
df.values.ravel(), df.values.astype('|S5').ravel()):
plt.text(x,y,s,ha='center',va='center')
For a single series in the dataframe, a few small changes is necessary:
In [35]:
sub_df=df.V1
pd.DataFrame(sub_df).boxplot()
for x, y, s in zip(np.repeat(1, df.shape[0]),
sub_df.ravel(), sub_df.values.astype('|S5').ravel()):
plt.text(x,y,s,ha='center',va='center')
Making scatter plots is also similar:
#for the whole thing
df.boxplot()
plt.scatter(np.repeat(np.arange(df.shape[1])+1, df.shape[0]), df.values.ravel(), marker='+', alpha=0.5)
#for just one column
sub_df=df.V1
pd.DataFrame(sub_df).boxplot()
plt.scatter(np.repeat(1, df.shape[0]), sub_df.ravel(), marker='+', alpha=0.5)
To overlay stuff on boxplot
, we need to first guess where each boxes are plotted at among xaxis
. They appears to be at 1,2,3,4,....
. Therefore, for the values in the first column, we want them to be plot at x=1; the 2nd column at x=2 and so on.
Any efficient way of doing it is to use np.repeat
, repeat 1,2,3,4...
, each for n
times, where n
is the number of observations. Then we can make a plot, using those numbers as x
coordinates. Since it is one-dimensional, for the y
coordinates, we will need a flatten view of the data, provided by df.ravel()
For overlaying the text strings, we need a anther step (a loop). As we can only plot one x value, one y value and one text string at a time.
Upvotes: 2