Overlaying boxplots on the relative bin of a histogram

Question

Taking the dataset 'tip' as an example

total_bill	tip	smoker	day	time	size
16.99	1.01	No	Sun	Dinner	2
10.34	1.66	No	Sun	Dinner	3
21.01	3.50	No	Sun	Dinner	3
23.68	3.31	No	Sun	Dinner	2
24.59	3.61	No	Sun	Dinner	4

what I'm trying to do is represent the distribution of the variable 'total_bill' and relate each of its bins to the distribution of the variable 'tip' linked to it. In this example, this graph is meant to answer the question: "What is the distribution of tips left by customers as a function of the bill they paid?"

I have more or less achieved the graph I wanted to obtain (but there is a problem. At the end I explain what it is).
And the procedure I adopted is this:

Dividing 'total_bill' into bins.

tips['bins_total_bill'] = pd.cut(tips.total_bill, 10)
tips.head()

total_bill	tip	smoker	day	time	size	bins_total_bill
16.99	1.01	No	Sun	Dinner	2	(12.618, 17.392]
10.34	1.66	No	Sun	Dinner	3	(7.844, 12.618]
21.01	3.50	No	Sun	Dinner	3	(17.392, 22.166]
23.68	3.31	No	Sun	Dinner	2	(22.166, 26.94]
24.59	3.61	No	Sun	Dinner	4	(22.166, 26.94]

Creation of a pd.Series with:
Index: pd.interval of total_cost bins
Values: n° of occurrences
```
s = tips['bins_total_bill'].value_counts(sort=False)
s
```

(3.022, 7.844]       7
(7.844, 12.618]     42
(12.618, 17.392]    68
(17.392, 22.166]    51
(22.166, 26.94]     31
(26.94, 31.714]     19
(31.714, 36.488]    12
(36.488, 41.262]     7
(41.262, 46.036]     3
(46.036, 50.81]      4
Name: bins_total_bill, dtype: int64

Combine barplot and poxplot together

fig, ax1 = plt.subplots(dpi=200)
ax2 = ax1.twinx()

sns.barplot(ax=ax1, x = s.index, y = s.values)
sns.boxplot(ax=ax2, x='bins_total_bill', y='tip', data=tips)
sns.stripplot(ax=ax2, x='bins_total_bill', y='tip', data=tips, size=5, color="yellow", edgecolor='red', linewidth=0.3)

#Title and axis labels
ax1.tick_params(axis='x', rotation=90)
ax1.set_ylabel('Number of bills')
ax2.set_ylabel('Tips [$]')
ax1.set_xlabel("Mid value of total_bill bins [$]")
ax1.set_title("Tips ~ Total_bill distribution")

#Reference lines average(tip) + add yticks + Legend
avg_tip = np.mean(tips.tip)
ax2.axhline(y=avg_tip, color='red', linestyle="--", label="avg tip")
ax2.set_yticks(list(ax2.get_yticks() + avg_tip))
ax2.legend(loc='best')

#Set labels axis x
ax1.set_xticklabels(list(map(lambda s: round(s.mid,2), s.index)))

It has to be said that this graph has a problem! As the x-axis is categorical, I cannot, for example, add a vertical line at the mean value of 'total_bill'.

How can I fix this to get the correct result? I also wonder if there is a correct and more streamlined approach than the one I have adopted.

Overlaying boxplots on the relative bin of a histogram

Answers (1)

Related Questions