blue-sky
blue-sky

Reputation: 53816

Interpreting the y values of a pdf

In trying to understand the y values of a normal distribution plot I use this code:

%reset -f

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

data = [10,10,20,40,50,60,70,80,90,100]

# Fit a normal distribution to the data:
mu, std = norm.fit(data)

# Plot the histogram.
plt.hist(data, bins=10, density=True, alpha=0.6, color='g')

# Plot the PDF.
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)

plt.show()

to generate this plot:

enter image description here

The data is the age in years of people in a group: [10,10,20,40,50,60,70,80,90,100]

How to interpret the y values of the generated pdf plot? For instance, how should the bar with play approx equal to 0.027 be interpreted?

I've read various posts such as :

https://stats.stackexchange.com/questions/332984/interpreting-a-pdf-plot

But can't find information that details an interpretation of the y axis values of the plot.

Is 0.027 the probability the age is in the range from 0 to approx 20 ?

Upvotes: 0

Views: 488

Answers (1)

Ewran
Ewran

Reputation: 328

The area under the pdf curve between two ages x_0 and x_1 represents the probability P(x_0 <= X <= x_1) that a point sampled from X belongs to the interval [x_0, x_1], where X is the (normal) random variable fitted on your dataset.

For the histogram, each bar represents an interval, and the height of a bar is equal to the number of samples belonging to that interval, normalized so that the total area of the bins of the histogram equals 1. Similarly to the pdf curve, the area of a bin gives an estimation of the probability that a random sample belongs to the interval defined by the bin.

If a normal distribution is indeed a good choice to model your random variable, one would then expect the histogram and the fitted pdf to get closer and closer as you add points to your dataset (for a well chosen number of bins).

Upvotes: 1

Related Questions