Reputation: 1
I have a dataset with categorical features that I want to cluster using a soft clustering approach, where each data point can belong to multiple clusters with different probabilities.
My questions are:
As a first attempt, I tried combining the Categorical Naive Bayes classifier with the Expectation Maximization (EM) algorithm. Here's the code I used:
import numpy as np
from sklearn.naive_bayes import CategoricalNB
def EM_CategoricalNB(X, k, max_iters=100, tol=1e-4, verbose=False):
"""
EM algorithm using CategoricalNB for clustering categorical data.
Parameters:
X (np.array): Data array, each row is a data point, and values are integer-encoded categories.
k (int): Number of clusters.
max_iters (int): Maximum number of iterations for EM to converge.
tol (float): Tolerance for checking convergence.
verbose (bool): Whether to print status messages.
Returns:
model (CategoricalNB): Trained Naive Bayes model.
assignments (np.array): Cluster assignments of data points.
"""
# Initialize Categorical Naive Bayes model
model = CategoricalNB()
assignments = np.random.randint(0, k, size=len(X))
model.fit(X, assignments)
previous_log_likelihood = -np.inf
for iteration in range(max_iters):
# E-step: Estimate membership probabilities
probabilities = model.predict_proba(X)
# M-step: Re-fit the model using the expected memberships as soft assignments
model.fit(X, np.argmax(probabilities, axis=1))
# Calculate log likelihood for convergence check
log_likelihood = model.score(X, np.argmax(probabilities, axis=1))
if verbose:
print(f"Iteration {iteration}: Log Likelihood = {log_likelihood}")
# Check for convergence
if np.abs(log_likelihood - previous_log_likelihood) < tol:
if verbose:
print("Convergence reached.")
break
previous_log_likelihood = log_likelihood
# Final assignments
final_assignments = model.predict(X)
return model, final_assignments
The idea is to initialize the Categorical Naive Bayes model with random cluster assignments, and then iteratively:
Upvotes: 0
Views: 50