user4933
user4933

Reputation: 1585

Plot density histogram of Bernoulli sample and a Bernoulli pmf together

Summary of Question:

Why is my density from my sample so different to the pmf and how can I perform this simulation so that the pmf and the sample estimates are similar.

Question:

I have simulated a sample of independent Bernoulli trials using scipy. I am now trying to take a density histogram of the sample that I created and compare it to the pmf (probability mass function). I would expect the density histogram to show two bins each hovering near the pmf but instead, I have 2 bins above the pmf values at 5. Could someone please show me how to create a density histogram that does not do this for the Bernoulli? I tried a similar simulation with a few other distributions and it seemed to work fine. What am I missing here and could you show me how to manipulate my code to make this work?

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

trials = 10**3
p = 0.5


sample_bernoulli = stats.bernoulli.rvs(p, size=trials) # Generate benoulli RV
plt.plot((0,1), stats.bernoulli.pmf((0,1), p), 'bo', ms=8, label='bernoulli pmf')

# Density histogram of generated values
plt.hist(sample_bernoulli, density=True, alpha=0.5, color='steelblue', edgecolor='none')
plt.show()

enter image description here

I must apologize if this is a simple or trivial question but I couldn't find a solution online and found the issue interesting. Any help at all would be appreciated.

Upvotes: 0

Views: 1164

Answers (1)

JohanC
JohanC

Reputation: 80459

The reason is that plt.hist is primarily meant to work with continuous distributions. If you don't provide explicit bin boundaries, plt.hist just creates 10 equally spaced bins between the minimum and maximum value. Most of these bins will be empty. With only two possible data values, there should be just two bins, so 3 boundaries:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

trials = 10**3
p = 0.5

sample_bernoulli = stats.bernoulli.rvs(p, size=trials) # Generate benoulli RV
plt.plot((0,1), stats.bernoulli.pmf((0,1), p), 'bo', ms=8, label='bernoulli pmf')

# Density histogram of generated values
plt.hist(sample_bernoulli, density=True, alpha=0.5, color='steelblue', edgecolor='none', bins=np.linspace(-0.5, 1.5, 3))
plt.show()

example plot

Here is a visualization of the default bin boundaries and how the samples fit into the bins. Note that with density=True, the histogram is normalized such that the area of all the bars sums to 1. In this case two bars are 0.1 wide and about 5.0 high, while 8 others have height zero. So, the total area is 2*0.1*5 + 8*0.0 = 1.

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

trials = 10 ** 3
p = 0.5

sample_bernoulli = stats.bernoulli.rvs(p, size=trials)  # Generate benoulli RV

# Density histogram of generated values with default bins
values, binbounds, bars = plt.hist(sample_bernoulli, density=True, alpha=0.2, color='steelblue', edgecolor='none')
# show the bin boundaries
plt.vlines(binbounds, 0, max(values) * 1.05, color='crimson', ls=':')
# show the sample values with a random displacement
plt.scatter(sample_bernoulli * 0.9 + np.random.uniform(0, 0.1, trials),
            np.random.uniform(0, max(values), trials), color='lime')
# show the index of each bin
for i in range(len(binbounds) - 1):
    plt.text((binbounds[i] + binbounds[i + 1]) / 2, max(values) / 2, i, ha='center', va='center', fontsize=20, color='crimson')
plt.show()

showing bin boundaries

Upvotes: 1

Related Questions