Reputation: 98
I am looping through a dataframe and building a histogram with a boxplot on top for each numerical parameter in my data. The purpose is to better understand all of the variables in the dataset. The code below works but my issue is that it prints spaces in between the bars in the histogram and I want zero space in between each bin. Any advice is appreciated - thanks!
for i in numerical_cols:
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)})
sns.distplot(raw[i], ax=ax_hist,kde=False)
sns.boxplot(raw[i], ax=ax_box)
ax_box.set(xlabel='')
sns.despine(ax=ax_hist)
sns.despine(ax=ax_box, left=True)
pdf.savefig()
plt.close()
pdf.close()
plt.cla()
print(" ")
print("Done Writing Frequency Visualizations!")
Upvotes: 0
Views: 1413
Reputation: 80329
Your data seems to be discrete, only allowing integer values. As such, a standard histogram can be confusing, as it creates equally sized bins that don't align with the discrete values. In this case, many bins stay empty. (When you'd have e.g. 100 values, bins will get multiple values, but due to rounding some bins would get more values than others.)
Explicit bins should be given, for example with bin boundaries at the halves between the integers:
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
p = np.random.rand(21) + 0.1
p /= p.sum()
raw_i = np.random.choice(range(21), size=1000000, p=p)
bins = np.arange( -0.5, raw_i.max()+1, 1)
fig, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)})
sns.distplot(raw_i, bins=bins, ax=ax_hist, kde=False)
sns.boxplot(raw_i, ax=ax_box)
ax_box.set(xlabel='')
sns.despine(ax=ax_hist)
sns.despine(ax=ax_box, left=True)
ax_box.set_yticks([])
plt.show()
Upvotes: 2