Reputation: 3
I have a multivariate dataset(4 dimensions) of 150 points. I should use EM algorithm to group the given set of points into clusters. The number of clusters to which these points fall into is also given as 3. So, the task is to find out the Gaussians of the three clusters.
I have used the EM algorithm for clustering of gaussian mixtures given in the book "Bishop-Pattern-Recognition-and-Machine-Learning-2006".The algorithm is attached in the image.
The code that I have written is as follows.
import numpy as np, math, pandas as pd
from scipy.stats import multivariate_normal
from decimal import Decimal
def clustering(points):
points = list(set(points)) #To eliminate redundant points
'''
Initialize random gaussians!
'''
mean = [np.array([5,4,1.5,0.4]), np.array([4.5,5,0.5,0.5]), np.array([6,3,2,1])]
covariance1 = np.array([[16,0,0,0], [0,4,0,0], [0,0,9,0], [0,0,0,1]])
covariance2 = np.array([[4,0,0,0], [0,16,0,0], [0,0,1,0], [0,0,0,9]])
covariance3 = np.array([[16,0,0,0], [0,1,0,0], [0,0,9,0], [0,0,0,4]])
covariance = [covariance1, covariance2, covariance3]
weights = np.array([0.3, 0.3, 0.4])
init_log_likelihood = None #Log likelihood to test convergence, initially None
threshold = 0.0000000000000000000001 #Threshold difference for convergence
iterations = 0 #for counting iterations
while True:
iterations +=1
prob_dict = {point:[] for point in points}
new_means = [np.array([0,0,0,0]) for _ in range(3)]
new_covariance = [np.array([[0,0,0,0], [0,0,0,0], [0,0,0,0], [0,0,0,0]]) for _ in range(3)]
log_likelihood = Decimal(0) #Using decimal to increase the precision of floating point numbers
for point in points:
prob_sum = 0 #Prob sum for normalization
for _ in range(3):
prob = multivariate_normal.pdf(point, mean[_], covariance[_], allow_singular=True) #Finding gaussian probability
prob = np.dot(weights[_], prob)+1 #Multiply gaussian probability with weight
prob_dict[point].append(prob)
prob_sum += prob
prob_sum = Decimal(prob_sum)
log_likelihood += prob_sum.ln() #comprehensive computation of log likelihood
for _ in range(3):
prob_dict[point][_] = prob_dict[point][_]/float(prob_sum) #Normalizing the probability
new_means[_] = new_means[_]+np.dot(prob_dict[point][_], list(point)) #New means for the next iteration
for _ in range(3):
weights[_] = sum([prob_item[1][_] for prob_item in prob_dict.items()]) #New weights
new_means[_] = np.divide(new_means[_], weights[_])
for point in points:
#New covariance
new_covariance[_] = new_covariance[_]+np.dot(np.dot(prob_dict[point][_], (np.array(point)-new_means[_])), np.transpose(np.array(point)-new_means[_]))
for _ in range(3):
new_covariance[_] = np.divide(new_covariance[_],weights[_])
weights[_] = weights[_]/len(points)
mean = new_means
covariance = new_covariance
#check for convergence
if init_log_likelihood and abs(init_log_likelihood-log_likelihood)<=threshold:
break
init_log_likelihood = log_likelihood
print("Means: ", mean)
print("Covariance: ", covariance)
print("Log-likehilhood: ", log_likelihood)
print("Iterations: ", iterations)
After certain number of iterations I always observe that the three gaussians converge into one i.e.; the parameters (mean, covariance matrix) for all the gaussians are same. But I should get three different gaussians as the given set of points fall into three clusters.
Please help me to figure out where I am going wrong.
Upvotes: 0
Views: 64