Mark McGown
Mark McGown

Reputation: 1095

How to highlight certain boxplots?

I have certain boxplots within each state that are statistically significant between two brands.

a4_dims = (40, 10)
fig, ax = pyplot.subplots(figsize=a4_dims)
dd=pd.melt(df_box,id_vars=['region'],value_vars=['Lowe\'s','Home Depot'],var_name='brands')
a = df_box.groupby(['region']).sum()
most_visits_order = a.assign(tmp=a.sum(axis=1)).sort_values('tmp', ascending=False).drop('tmp', 1).index.tolist()
sns.boxplot(x='region',y='value',data=dd,hue='brands',showfliers=False,order=most_visits_order)

enter image description here

How can I highlight or call attention to the states I've found statistical differences in? (say it's TX, GA for example)

I've tried to convert it to a forloop method so I could manually add them for each x but that didn't work out too well:

fig, ax = plt.subplots()
n=len(stat_sig)
fig,ax = plt.subplots(n,1, figsize=(6,n*2), sharex=True,squeeze=False)
for i in range(n):
    plt.sca(ax[i])
    dd=pd.melt(df_box[df_box['region']==stat_sig[i]],id_vars=['region'],value_vars=['Lowe\'s','Home Depot'],var_name='brands')
    ax = sns.boxplot(x='region',y='value',data=dd,hue='brands',width=0.2)
ax.legend_.remove()
plt.show()

Error: TypeError: unhashable type: 'numpy.ndarray'

Upvotes: 0

Views: 686

Answers (1)

StupidWolf
StupidWolf

Reputation: 46898

Most likely easier if you start with the melted dataframe, for example:

from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

np.random.seed(27)
dd = pd.DataFrame({'region':np.random.choice(['A','B','C','D','E'],50),
                  'value':np.random.uniform(0,1,50),
                  'brands':np.random.choice(['1','2'],50)})

o = dd.groupby(['region']).sum()['value'].sort_values().index

Get a function to perform a t.test or something else that returns you the p-values:

def dotest(df):
    x,y = df.groupby('brands')['value'].apply(list)
    return stats.ttest_ind(x,y)[1]

pvalues = dd.groupby('region').apply(dotest)[o]

In the plot, the coordinates of the boxes will be 0.5-1.5 for the first region , 1.5-2.5 for the second and so on. So you just need to figure out which of your regions are significant and highlight them:

fig,ax = plt.subplots(1,1)
sns.boxplot(x='region',y='value',data=dd,hue='brands',showfliers=False,order=o,ax=ax)
for i in np.where(pvalues<0.05)[0]:
    ax.axvspan(i-0.4,i+0.4, color='red', alpha=0.1)

enter image description here

Upvotes: 1

Related Questions