Reputation: 155
Taking the dataset 'tip' as an example
total_bill | tip | smoker | day | time | size |
---|---|---|---|---|---|
16.99 | 1.01 | No | Sun | Dinner | 2 |
10.34 | 1.66 | No | Sun | Dinner | 3 |
21.01 | 3.50 | No | Sun | Dinner | 3 |
23.68 | 3.31 | No | Sun | Dinner | 2 |
24.59 | 3.61 | No | Sun | Dinner | 4 |
what I'm trying to do is represent the distribution of the variable 'total_bill' and relate each of its bins to the distribution of the variable 'tip' linked to it. In this example, this graph is meant to answer the question: "What is the distribution of tips left by customers as a function of the bill they paid?"
I have more or less achieved the graph I wanted to obtain (but there is a problem. At the end I explain what it is).
And the procedure I adopted is this:
Dividing 'total_bill' into bins.
tips['bins_total_bill'] = pd.cut(tips.total_bill, 10)
tips.head()
total_bill | tip | smoker | day | time | size | bins_total_bill |
---|---|---|---|---|---|---|
16.99 | 1.01 | No | Sun | Dinner | 2 | (12.618, 17.392] |
10.34 | 1.66 | No | Sun | Dinner | 3 | (7.844, 12.618] |
21.01 | 3.50 | No | Sun | Dinner | 3 | (17.392, 22.166] |
23.68 | 3.31 | No | Sun | Dinner | 2 | (22.166, 26.94] |
24.59 | 3.61 | No | Sun | Dinner | 4 | (22.166, 26.94] |
Creation of a pd.Series with:
Index: pd.interval of total_cost bins
Values: n° of occurrences
s = tips['bins_total_bill'].value_counts(sort=False)
s
(3.022, 7.844] 7 (7.844, 12.618] 42 (12.618, 17.392] 68 (17.392, 22.166] 51 (22.166, 26.94] 31 (26.94, 31.714] 19 (31.714, 36.488] 12 (36.488, 41.262] 7 (41.262, 46.036] 3 (46.036, 50.81] 4 Name: bins_total_bill, dtype: int64
Combine barplot and poxplot together
fig, ax1 = plt.subplots(dpi=200)
ax2 = ax1.twinx()
sns.barplot(ax=ax1, x = s.index, y = s.values)
sns.boxplot(ax=ax2, x='bins_total_bill', y='tip', data=tips)
sns.stripplot(ax=ax2, x='bins_total_bill', y='tip', data=tips, size=5, color="yellow", edgecolor='red', linewidth=0.3)
#Title and axis labels
ax1.tick_params(axis='x', rotation=90)
ax1.set_ylabel('Number of bills')
ax2.set_ylabel('Tips [$]')
ax1.set_xlabel("Mid value of total_bill bins [$]")
ax1.set_title("Tips ~ Total_bill distribution")
#Reference lines average(tip) + add yticks + Legend
avg_tip = np.mean(tips.tip)
ax2.axhline(y=avg_tip, color='red', linestyle="--", label="avg tip")
ax2.set_yticks(list(ax2.get_yticks() + avg_tip))
ax2.legend(loc='best')
#Set labels axis x
ax1.set_xticklabels(list(map(lambda s: round(s.mid,2), s.index)))
It has to be said that this graph has a problem! As the x-axis is categorical, I cannot, for example, add a vertical line at the mean value of 'total_bill'.
How can I fix this to get the correct result? I also wonder if there is a correct and more streamlined approach than the one I have adopted.
Upvotes: 3
Views: 428
Reputation: 155
I thought of this method, which is more compact than the previous one (it can probably be done better) and overcomes the problem of scaling on the x-axis.
I split 'total_bill' into bins and add the column to Df
tips['bins_total_bill'] = pd.cut(tips.total_bill, 10)
Group column 'tip' by previously created bins
obj_gby_tips = tips.groupby('bins_total_bill')['tip']
gby_tip = dict(list(obj_gby_tips))
Create dictionary with:
keys: midpoint of each bins interval
values: gby tips for each interval
mid_total_bill_bins = list(map(lambda bins: bins.mid, list(gby_tip.keys())))
gby_tips = gby_tip.values()
tip_gby_total_bill_bins = dict(zip(mid_total_bill_bins, gby_tips))
Create chart by passing to each rectangle of the boxplot the centroid of each respective bins
fig, ax1 = plt.subplots(dpi=200)
ax2 = ax1.twinx()
bp_values = list(tip_gby_total_bill_bins.values())
bp_pos = list(tip_gby_total_bill_bins.keys())
l1 = sns.histplot(tips.total_bill, bins=10, ax=ax1)
l2 = ax2.boxplot(bp_values, positions=bp_pos, manage_ticks=False, patch_artist=True, widths=2)
#Average tips as hline
avg_tip = np.mean(tips.tip)
ax2.axhline(y=avg_tip, color='red', linestyle="--", label="avg tip")
ax2.set_yticks(list(ax2.get_yticks() + avg_tip)) #add value of avg(tip) to y-axis
#Average total_bill as vline
avg_total_bill=np.mean(tips.total_bill)
ax1.axvline(x=avg_total_bill, color='orange', linestyle="--", label="avg tot_bill")
then the result.
Upvotes: 1