Viktor
Viktor

Reputation: 333

histogram is giving strange values of probability density function

I am trying to build a histogram with probability density function on the y-axis. On all examples, this function is in the range from 0 to 1 which is clear for me.

But as a result of my code, I obtain the y-values from 0 to 8000. And this I don't understand. The code is below. On final stage I get axerr array like:

 [1.57020e-04 1.97490e-04 5.03800e-04 2.24770e-04 1.14830e-04 1.08260e-04
 2.18140e-04 1.21900e-04 1.74830e-04 1.93780e-04 1.71790e-04 1.77780e-04
 2.20330e-04 7.95300e-05 1.14852e-03 1.79160e-04 1.75580e-04 2.78850e-04
 1.69800e-04 2.47200e-04 1.65380e-04 1.88840e-04 1.21360e-04 2.36600e-04
 1.64360e-04 1.97670e-04 1.72710e-04 1.44440e-04 2.31840e-04 9.81200e-05
 7.15160e-04 1.65960e-04 2.67680e-04 1.85360e-04 1.88800e-04 1.88370e-04
 1.52610e-04 1.90090e-04 1.46900e-04 1.72760e-04 1.50750e-04 1.44710e-04
 1.89070e-04 1.69380e-04 1.48960e-04 1.68550e-04 3.64510e-04 3.70100e-04
 1.43380e-04 1.03310e-04 1.92930e-04 2.02960e-04 2.19060e-04 2.20950e-04
 1.19170e-04 1.36040e-04 2.61100e-04 2.19740e-04 2.54570e-04 9.49600e-05
 1.84260e-04 1.62430e-04 2.20980e-04 1.61800e-04 1.84360e-04 1.42410e-04
 1.65170e-04 1.50550e-04 2.65350e-04 2.12590e-04 1.15280e-04 1.03920e-04
 1.64550e-04 1.76450e-04 6.51310e-04 1.75970e-04 1.49710e-04 1.37470e-04
 3.68000e-04 2.71530e-04 1.37340e-04 1.16980e-04 1.36640e-04 1.76450e-04
 3.06170e-04 1.93390e-04 1.57760e-04 2.41060e-04 1.57280e-04 6.49310e-04
 1.35760e-04 1.16790e-04 1.44440e-04 1.53720e-04 1.28480e-04 1.83890e-04
 8.38500e-05 2.57420e-04 1.77980e-04 2.44480e-04 1.19400e-04 1.67780e-04
 1.71860e-04 1.67000e-04 1.53590e-04 9.11300e-05 2.09940e-04 1.41630e-04
 1.06670e-04 1.44750e-04 1.21140e-04 1.14270e-04 1.29120e-04 1.26637e-03
 1.44300e-04 1.02310e-04 1.53900e-04 1.48930e-04 1.92910e-04 2.37970e-04
 1.87570e-04 1.77940e-04 2.56090e-04 1.97750e-04 1.83930e-04 2.01870e-04
 1.27830e-04 8.03000e-05 1.50350e-04 3.85020e-04 1.85530e-04 1.68040e-04
 2.71320e-04 1.50470e-04 3.41840e-04 2.07820e-04 2.09820e-04 1.27700e-04
 1.58620e-04 1.57040e-04 1.50540e-04 1.15330e-04 1.37910e-04 1.96580e-04
 1.94320e-04 1.09880e-04 1.64360e-04 8.10400e-05 1.69810e-04 1.23360e-04
 2.33720e-04 2.47400e-04 1.33170e-04 2.00550e-04 2.47920e-04 1.45160e-04
 1.59030e-04 2.39060e-04 1.70110e-04 1.29450e-04 5.14930e-04 1.76020e-04
 1.10990e-04 2.02560e-04 8.50800e-05 2.12330e-04 1.48860e-04 9.51500e-05
 1.29200e-04 1.27250e-04 1.40320e-04 2.27170e-04 1.69850e-04 1.73830e-04
 2.19320e-04 2.47860e-04 2.93890e-04 2.66180e-04 1.58140e-04 1.52950e-04
 4.14790e-04 1.18380e-04 1.88540e-04 1.90790e-04 1.46800e-04 3.27730e-04]

The result is histogram like on the figure: https://drive.google.com/open?id=1CnL5gzuZivAFAcJQPOT4l6SqJvimjuao

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats


f = open("primary.txt")


axc, axerr = [], []

for line in f:
    xc, xerr = line.split()
    xc = float(xc)
    xerr = float(xerr)
    if xerr != 0 and xerr < 0.01:
        axc.append(xc)
        axerr.append(xerr)

axc = np.array(axc)
axerr = np.array(axerr)

plt.hist(axerr, bins='auto', density=True)

#kernel = stats.gaussian_kde(axerr)

#x2 = np.linspace(np.min(axerr), np.max(axerr), 300)
#plt.plot(x2, kernel(x2), "b-")

plt.xlabel('Error')
plt.ylabel('Probability')
plt.savefig("stat.png")

Upvotes: 1

Views: 311

Answers (1)

Sheldore
Sheldore

Reputation: 39072

The values and the plot are not strange but correct. The reason is following: When you use density=True, it normalizes the distribution which means the area covered under the curve is 1. In terms of histogram, it would mean that the total area of the bars would sum up to 1.

Since your x-values are on the order of 10^(-3) to 10^(-4), the values on the y-axis are accordingly rescaled to be on the order of 10^3-10^4. If you compute the area covered by your bars in the histogram, you will indeed find that they sum up to 1 which is what density=True will do.

From the docs:

density : bool, optional

If True, the first element of the return tuple will be the counts normalized to form a probability density, i.e., the area (or integral) under the histogram will sum to 1. This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations.

Upvotes: 2

Related Questions