Reputation: 63
The following methods are supposed to compute a PDF:
bins = [20 23 31.5 57 62.5 89 130]; % classes of values of my random variable
mean = 23;
std = mean/2;
values = mean + std*randn(1000,1); % the actual values of the RV
% method 1
[num, bins] = hist(values, bins); % histogram on the previously defined bins
pdf2 = num/trapz(bins, num);
trapz(bins, pdf2) % checking the integral under the curve, which is indeed 1
ans =
1
% method 2
pdf1 = normpdf(bins, mean, std); % the Matlab command for creating o normal PDF
trapz(bins, pdf1) % this is NOT equal to 1
ans =
0.7069
However if I consider the bins something like
bins = [0:46];
the results are
ans =
1
ans =
0.9544
so I still have not the value 1 for the integral in the case of normpdf
.
Why does the normpdf
not give the integral equal to 1 for the PDFs? Is there something I'm missing in the codes above?
Upvotes: 1
Views: 997
Reputation: 37626
The problem is that you are missing a lot of values from your PDF, if you take bins = [0:46]
, you have the following curve:
Which means you are missing all the part x < 0
and x > 46
, so the integral you are computing is not from -oo
to +oo
as you expect but from 0
to 46
, obvisouly you won't get the correct answer.
Note that you have mean = 23
and std = 11.5
, thus if you have bins = 0:46
you have a band around the mean with a width of one std
on each side, so according to the 68–95–99.7 rule, 95% of the values lie in this band, which is consistent with the 0.9544 you get.
If you take bins = -11.5:57.5
, you now have three standard deviations on each side, and you will get 99.7%
of the values in this band (MATLAB gives me 0.9973
, the filled area are the ones you did not have with bins = 0:46
):
Note that if you want to reach 1.000
with an error better than 10-3, you need about 3.4 standard deviations1:
>> bins = (mean - 3.4 * std):(mean + 3.4 * std) ;
>> pdf = normpdf(bins, mean, std) ;
>> trapz(bins, pdf)
0.9993
Note that with bins = [20 23 31.5 57 62.5 89 130];
, you have both a problem of precision and a problem of missing values (your curve is the blue one, the red one was generated using bins = 20:130
):
Obvisouly, if you compute the area under the blue curve, you won't get the value of the area under the red one, and you won't certainly get a value close to the integral of the red curve between -oo
and +oo
.
Upvotes: 5