pythonpandasmatplotlibseabornplot-annotations

Reputation: 1285

How does one insert statistical annotations (stars or p-values)

This seems like a trivial question, but I've been searching for a while and can't seem to find an answer. It also seems like something that should be a standard part of these packages. Does anyone know if there is a standard way to include statistical annotation between distribution plots in seaborn?

For example, between two box or swarmplots?

Upvotes: 56

Answers (3)

Elide

Reputation: 61

You could use the package starbars. You give the pairs and their p-value, and it draws it for you:

import seaborn as sns
import matplotlib.pyplot as plt
import starbars

# taking from the previous example
tips = sns.load_dataset("tips")
df = tips.pivot(columns='day', values='total_bill')
data = [df[c].dropna().tolist() for c in df.columns]
sns.boxplot(x="day", y="total_bill", data=tips)

# adding statistical annotation
annotations = [("Sat", "Sun", 0.002), ("Fri", "Thur", 0.05)]
starbars.draw_annotation(annotations)

plt.show()

It also has an option not to show non-significant p-value bars:

starbars.draw_annotation(annotations, ns_show=False)

You can find the starbars documentation here.

Disclaimer: I'm the author of the package.

Upvotes: 4

Ulrich Stern

Reputation: 11301

A brace / bracket can be plotted direct with matplotlib.pyplot.plot or matplotlib.axes.Axes.plot, and annotations can be added with matplotlib.pyplot.text or matplotlib.axes.Axes.text.

seaborn categorical plots are 0 indexed, whereas box plots, by default, with matplotlib and pandas, start at range(1, N+1), which can be adjusted with the positions parameter.

seaborn is a high-level API for matplotlib, and pandas.DataFrame.plot uses matplotlib as the default backend.

Imports and DataFrame

import seaborn as sns
import matplotlib.pyplot as plt

# dataframe in long form for seaborn
tips = sns.load_dataset("tips")

# dataframe in wide form for plotting with pandas.DataFrame.plot
df = tips.pivot(columns='day', values='total_bill')

# data as a list of lists for plotting directly with matplotlib (no nan values allowed)
data = [df[c].dropna().tolist() for c in df.columns]

`seaborn`

sns.boxplot(x="day", y="total_bill", data=tips, palette="PRGn")

# statistical annotation
x1, x2 = 2, 3   # columns 'Sat' and 'Sun' (first column: 0, see plt.xticks())
y, h, col = tips['total_bill'].max() + 2, 2, 'k'

plt.plot([x1, x1, x2, x2], [y, y+h, y+h, y], lw=1.5, c=col)
plt.text((x1+x2)*.5, y+h, "ns", ha='center', va='bottom', color=col)

plt.show()

`pandas.DataFrame.plot`

ax = df.plot(kind='box', positions=range(len(df.columns)))

x1, x2 = 2, 3
y, h, col = df.max().max() + 2, 2, 'k'

ax.plot([x1, x1, x2, x2], [y, y+h, y+h, y], lw=1.5, c=col)
ax.text((x1+x2)*.5, y+h, "ns", ha='center', va='bottom', color=col)

`matplotlib`

plt.boxplot(data, positions=range(len(data)))

x1, x2 = 2, 3

y, h, col = max(map(max, data)) + 2, 2, 'k'

plt.plot([x1, x1, x2, x2], [y, y+h, y+h, y], lw=1.5, c=col)
plt.text((x1+x2)*.5, y+h, "ns", ha='center', va='bottom', color=col)

tips.head()

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

df.head()

day  Thur  Fri  Sat    Sun
0     NaN  NaN  NaN  16.99
1     NaN  NaN  NaN  10.34
2     NaN  NaN  NaN  21.01
3     NaN  NaN  NaN  23.68
4     NaN  NaN  NaN  24.59

data

[[27.2, 22.76, 17.29, ..., 20.53, 16.47, 18.78],
 [28.97, 22.49, 5.75, ..., 13.42, 16.27, 10.09],
 [20.65, 17.92, 20.29, ..., 29.03, 27.18, 22.67, 17.82],
 [16.99, 10.34, 21.01, ..., 18.15, 23.1, 15.69]]

Upvotes: 81

fokkerplanck

Reputation: 1044

One may also be interested in adding several annotations to different pairs of boxes. In such a case, it might be useful to handle the placement of the different lines and texts in the y-axis automatically. I and other contributors wrote a small function to handle these cases (see Github repo), which correctly stacks the lines one on top of each other without overlapping. Annotations can be either inside or outside the plot, and several statistical tests are implemented: Mann-Whitney and t-test (independent and paired). Here is one minimal example.

import matplotlib.pyplot as plt
import seaborn as sns
from statannot import add_stat_annotation

sns.set(style="whitegrid")
df = sns.load_dataset("tips")

x = "day"
y = "total_bill"
order = ['Sun', 'Thur', 'Fri', 'Sat']
ax = sns.boxplot(data=df, x=x, y=y, order=order)
add_stat_annotation(ax, data=df, x=x, y=y, order=order,
                    box_pairs=[("Thur", "Fri"), ("Thur", "Sat"), ("Fri", "Sun")],
                    test='Mann-Whitney', text_format='star', loc='outside', verbose=2)

x = "day"
y = "total_bill"
hue = "smoker"
ax = sns.boxplot(data=df, x=x, y=y, hue=hue)
add_stat_annotation(ax, data=df, x=x, y=y, hue=hue,
                    box_pairs=[(("Thur", "No"), ("Fri", "No")),
                                 (("Sat", "Yes"), ("Sat", "No")),
                                 (("Sun", "No"), ("Thur", "Yes"))
                                ],
                    test='t-test_ind', text_format='full', loc='inside', verbose=2)
plt.legend(loc='upper left', bbox_to_anchor=(1.03, 1))

Upvotes: 73

How does one insert statistical annotations (stars or p-values)

Answers (3)

Imports and DataFrame

seaborn

pandas.DataFrame.plot

matplotlib

Related Questions

`seaborn`

`pandas.DataFrame.plot`

`matplotlib`