Safi Khan
Safi Khan

Reputation: 33

How to effectively compute the pdf of a given dataset

I tried to compute the probablity distribution function of my iris dataset for petal lengths of setosa flowers using numpy.histogram

I wanted to plot the probablity distribution function for the petal length of the setosa flowers. Unfortunately i got confused in what actually np.histogram returns us. In the below code using my vague knowledge i set the bins to 10 and density to true.

Could anyone please provide any insight so as to what the below code does and essentially what a PDF is? Also is there any other better way to compute the PDF for the given data set?

import pandas as pd
import numpy as np

iris = pd.read_csv('iris.csv')
iris_setosa = iris[iris.species == 'setosa']

counts,bin_edges=np.histogram(iris_setosa["petal_length"],bins=10,density=True)

pdf=counts/sum(counts)

Upvotes: 3

Views: 11065

Answers (3)

Karthik Vijayasarathi
Karthik Vijayasarathi

Reputation: 66

Let me put this way -

When you run the below line and print out counts, bin_edges variables

counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10,density=True)

The result will be

counts --> [0.22222222 0.22222222 0.44444444 1.55555556 2.66666667 3.11111111 1.55555556 0.88888889 0. 0.44444444]

bin_edges --> [1. 1.09 1.18 1.27 1.36 1.45 1.54 1.63 1.72 1.81 1.9 ]

Data source - Iris Data set Numpy - Numpy

So what the above code does at the back end is the following:

1.Firstly, based on the bin width and the minimum and maximum values in setosa petal length data set, it will first calculate a certain bin width and then create a histogram where X axis would be petal length and Y axis would be the number of flowers. This you can see if you just remove the parameter density from the above code.

counts_number, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10)

This would result in -- counts_number --> [ 1 1 2 7 12 14 7 4 0 2] So this means there is just 1 flower in the bin [1-1.09).

2.Next it will calculate the relative frequency for each data point i.e it will divide the counts_number by total number of flowers(Here 50. I got this value from the data set available on google). You can see this by this :

rel_freq =counts_number/50
print(rel_freq)

This would result in -- > [0.02 0.02 0.04 0.14 0.24 0.28 0.14 0.08 0. 0.04]

These are relative frequencies and can also be interpreted as probability values. This interpretation is based on the concept of law of large numbers ([Law of large numbers])3

3.The Y values in any PDF´s are not actual probabilities, but are probability density. So if you divide the rel_freq by the bin width, we would get

--> [0.22222222 0.22222222 0.44444444 1.55555556 2.66666667 3.11111111 1.55555556 0.88888889 0. 0.44444444]

As you can see, this is same as the one which we got just by using density =True parameter

As you have not provided the complete code as what you are trying to do after calculating variable pdf . Let me make my assumptions and explain it further.

The Y axis values in any PDF will/can be more than 1 as they are densities and not probabilities. The code line in your program

pdf=counts/sum(counts)

normalizes the pdf numpy array. To put it in a more sensible way, the above line of code is doing the same thing as multiplying the counts array with bin width i.e. it is recalculating the relative frequencies(a.k.a probabilities) from the densities. So, if you run this below code line

print(counts*0.09) -- > here 0.09 is the bin width for bin size of 10

It will give --- > [0.02 0.02 0.04 0.14 0.24 0.28 0.14 0.08 0. 0.04]

This is exactly same as the variable pdf

Now, may be you can use this pdf array to calculate the cdf as CDF is the cumulative sum of the probabilities at each bin width. Using the counts directly in calculating CDF would not make sense.

Now, if we plot the pdfs with the help of below lines of code. Note - Make sure you import relevant libraries to plot . Below is just a example code

plt.plot(bin_edges[1:],pdf,label="normalised_pdf")
plt.plot(bin_edges[1:],counts,label="actual_pdf")

This would result in

Resulting graph

You can see in the graph that, they are just scaled version of each other.

Upvotes: 5

ljagodz
ljagodz

Reputation: 34

Since you set density=True, it is most correct to say that what is being computed here is the probability density function. The term probability distribution function is kind of ambiguous, since there are a number of ways to quantify the probability distribution of data.

I will provide a link the Wikipedia page for the probability density function, but essentially its integral over a given range gives you the probability of that range.

Probability Density Function: https://en.wikipedia.org/wiki/Probability_density_function

So if I understand correctly, in this line:

pdf=counts/sum(counts)

You were trying to normalize the values of counts. From my understanding, density=True has already done that for you, so no need to do the above line of code.

I do not know if there is a better way to compute a PDF in this instance, but from what I can tell increasing the number of bins would give you a better approximation of the PDF.

numpy.histogram: https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html

Upvotes: 0

Tarifazo
Tarifazo

Reputation: 4343

You can use the np.histogram function to create a histogram from sample data and the scipy.stats.rv_histogram function to work with it. See the docs for rv_histogram here for an illustration.

The rv_histogram stores the parameters of your distribution, and, among other things, can be used to calculate pdf or cdf:

from scipy.stats import rv_histogram
import numpy as np

x = np.random.random(10000)
r = rv_histogram(np.histogram(x, bins=100))

r.pdf(np.linspace(0,1,5))  # 0, 0.25, 0.5, 0.75, 1
>> array([0.        , 0.96009784, 1.05010702, 0.97009886, 0.        ])

r.cdf(np.linspace(0,1,5))
>> array([0.        , 0.2554366 , 0.50824724, 0.75229438, 1.        ])

Upvotes: 3

Related Questions