Reputation: 33
I am implementing Gaussian Naive Bayes Algorithm:
# importing modules
import pandas as pd
import numpy as np
# create an empty dataframe
data = pd.DataFrame()
# create our target variable
data["gender"] = ["male","male","male","male",
"female","female","female","female"]
# create our feature variables
data["height"] = [6,5.92,5.58,5.92,5,5.5,5.42,5.75]
data["weight"] = [180,190,170,165,100,150,130,150]
data["foot_size"] = [12,11,12,10,6,8,7,9]
# view the data
print(data)
# create an empty dataframe
person = pd.DataFrame()
# create some feature values for this single row
person["height"] = [6]
person["weight"] = [130]
person["foot_size"] = [8]
# view the data
print(person)
# Priors can be calculated either constants or probability distributions.
# In our example, this is simply the probability of being a gender.
# calculating prior now
# number of males
n_male = data["gender"][data["gender"] == "male"].count()
# number of females
n_female = data["gender"][data["gender"] == "female"].count()
# total people
total_ppl = data["gender"].count()
print ("Male count =",n_male,"and Female count =",n_female)
print ("Total number of persons =",total_ppl)
# number of males divided by the total rows
p_male = n_male / total_ppl
# number of females divided by the total rows
p_female = n_female / total_ppl
print ("Probability of MALE =",p_male,"and FEMALE =",p_female)
# group the data by gender and calculate the means of each feature
data_means = data.groupby("gender").mean()
# view the values
data_means
# group the data by gender and calculate the variance of each feature
data_variance = data.groupby("gender").var()
# view the values
data_variance
data_variance = data.groupby("gender").var()
data_variance["foot_size"][data_variance.index == "male"].values[0]
# means for male
male_height_mean=data_means["height"][data_means.index=="male"].values[0]
male_weight_mean=data_means["weight"][data_means.index=="male"].values[0]
male_footsize_mean=data_means["foot_size"][data_means.index=="male"].values[0]
print (male_height_mean,male_weight_mean,male_footsize_mean)
# means for female
female_height_mean=data_means["height"][data_means.index=="female"].values[0]
female_weight_mean=data_means["weight"][data_means.index=="female"].values[0]
female_footsize_mean=data_means["foot_size"][data_means.index=="female"].values[0]
print (female_height_mean,female_weight_mean,female_footsize_mean)
# variance for male
male_height_var=data_variance["height"][data_variance.index=="male"].values[0]
male_weight_var=data_variance["weight"][data_variance.index=="male"].values[0]
male_footsize_var=data_variance["foot_size"][data_variance.index=="male"].values[0]
print (male_height_var,male_weight_var,male_footsize_var)
# variance for female
female_height_var=data_variance["height"][data_variance.index=="female"].values[0]
female_weight_var=data_variance["weight"][data_variance.index=="female"].values[0]
female_footsize_var=data_variance["foot_size"][data_variance.index=="female"].values[0]
print (female_height_var,female_weight_var,female_footsize_var)
# create a function that calculates p(x | y):
def p_x_given_y(x,mean_y,variance_y):
# input the arguments into a probability density function
p = 1 / (np.sqrt(2 * np.pi * variance_y)) * \
np.exp((-(x - mean_y) ** 2) / (2 * variance_y))
# return p
return p
# numerator of the posterior if the unclassified observation is a male
posterior_numerator_male = p_male * \
p_x_given_y(person["height"][0],male_height_mean,male_height_var) * \
p_x_given_y(person["weight"][0],male_weight_mean,male_weight_var) * \
p_x_given_y(person["foot_size"][0],male_footsize_mean,male_footsize_var)
# numerator of the posterior if the unclassified observation is a female
posterior_numerator_female = p_female * \
p_x_given_y(person["height"][0],female_height_mean,female_height_var) * \
p_x_given_y(person["weight"][0],female_weight_mean,female_weight_var) * \
p_x_given_y(person["foot_size"][0],female_footsize_mean,female_footsize_var)
print ("Numerator of Posterior MALE =",posterior_numerator_male)
print ("Numerator of Posterior FEMALE =",posterior_numerator_female)
if (posterior_numerator_male >= posterior_numerator_female):
print ("Predicted gender is MALE")
else:
print ("Predicted gender is FEMALE")
When we are calculating the probability, we are calculating it using the Gaussian PDF:
$$ P(x) = \frac{1}{\sqrt {2 \pi {\sigma}^2}} e^{\frac{-(x- \mu)^2}{2 {\sigma}^2}} $$
My question is that the above equation is that of a PDF. To calculate probability, we have to integrate it over an area dx.
$ \int_{x0}^{x1} P(x)dx $
But in the above program, we are plugging the value of x and calculating the probability. Is that correct? Why? I have seen most of the articles calculating the probability ib the same manner.
If this is the wrong way to calculate the probability in the Naive Bayes Classifier, then what is the correct method?
Upvotes: 3
Views: 861
Reputation: 14858
The method is correct. The pdf
function is a probability density, i.e., a function that measures the probability of being in a neighborhood of a value divided by the "size" of such a neighborhood, where the "size" is the length in dimension 1, the area in 2, the volume in 3, etc.
In continuous probabilities the probability of getting precisely any given outcome is 0, and this is why densities are used instead. Therefore, we don't deal with expressions such as P(X=x)
but with P(|X-x| < Δ(x))
, which stands for the probability of X
being close x
.
Let me simplify the notation and write P(X~x)
for P(|X-x| < Δ(x))
.
If you apply the Bayes rule here, you will get
P(X~x|W~w) = P(W~w|X~x)*P(X~x)/P(W~w)
because we are dealing with probabilities. If we now introduce densities:
pdf(x|w)*Δ(x) = pdf(w|x)Δ(w)*pdf(x)Δ(x)/(pdf(w)*Δ(w))
because probability = density*neighborhood_size
. And since all Δ(·)
cancel out in the expression above, we get
pdf(x|w) = pdf(w|x)*pdf(x)/pdf(w)
which is the Bayes rule for densities.
The conclusion is that, given that the Bayes rule also holds for densities, it is legitimate to use the same methods replacing probabilities with densities when dealing with continuous random variables.
Upvotes: 3