Reputation: 889
I have created a histogram in a Jupyter notebook to show the distribution of time on page in seconds for 100 web visits.
Code as follows:
ax = df.hist(column='time_on_page', bins=25, grid=False, figsize=(12,8), color='#86bf91', zorder=2, rwidth=0.9)
ax = ax[0]
for x in ax:
# Despine
x.spines['right'].set_visible(False)
x.spines['top'].set_visible(False)
x.spines['left'].set_visible(False)
# Switch off ticks
x.tick_params(axis="both", which="both", bottom="off", top="off", labelbottom="on", left="off", right="off", labelleft="on")
# Draw horizontal axis lines
vals = x.get_yticks()
for tick in vals:
x.axhline(y=tick, linestyle='dashed', alpha=0.4, color='#eeeeee', zorder=1)
# Set title
x.set_title("Time on Page Histogram", fontsize=20, weight='bold', size=12)
# Set x-axis label
x.set_xlabel("Time on Page Duration (Seconds)", labelpad=20, weight='bold', size=12)
# Set y-axis label
x.set_ylabel("Page Views", labelpad=20, weight='bold', size=12)
# Format y-axis label
x.yaxis.set_major_formatter(StrMethodFormatter('{x:,g}'))
This produces the following visualisation:
I'm generally happy with the appearance however I'd like for the axis to be a little more descriptive, perhaps showing the bin range for each bin and the percentage of the total that each bin constitutes.
Have looked for this in the Matplotlib documentation but cannot seem ot find anything that would allow me to achieve my end goal.
Any help greatly appreciated.
Upvotes: 1
Views: 1212
Reputation: 80329
When you set bins=25
, 25 equally spaced bins are set between the lowest and highest values encountered. If you use these ranges to mark the bins, things can be confusing due to the arbitrary values. It seems more adequate to round these bin boundaries, for example to multiples of 20. Then, these values can be used as tick marks on the x-axis, nicely between the bins.
The percentages can be added by looping through the bars (rectangular patches). Their height indicates the number of rows belonging to the bin, so dividing by the total number of rows and multiplying by 100 gives a percentage. The bar height, x and half width can position the text.
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame({'time_on_page': np.random.lognormal(4, 1.1, 100)})
max_x = df['time_on_page'].max()
bin_width = max(20, np.round(max_x / 25 / 20) * 20) # round to multiple of 20, use max(20, ...) to avoid rounding to zero
bins = np.arange(0, max_x + bin_width, bin_width)
axes = df.hist(column='time_on_page', bins=bins, grid=False, figsize=(12, 8), color='#86bf91', rwidth=0.9)
ax = axes[0, 0]
total = len(df)
ax.set_xticks(bins)
for p in ax.patches:
h = p.get_height()
if h > 0:
ax.text(p.get_x() + p.get_width() / 2, h, f'{h / total * 100.0 :.0f} %\n', ha='center', va='center')
ax.grid(True, axis='y', ls=':', alpha=0.4)
ax.set_axisbelow(True)
for dir in ['left', 'right', 'top']:
ax.spines[dir].set_visible(False)
ax.tick_params(axis="y", length=0) # Switch off y ticks
ax.margins(x=0.02) # tighter x margins
plt.show()
Upvotes: 2