Categorical Naive Bayes and EM for soft clustering categorical data

Question

I have a dataset with categorical features that I want to cluster using a soft clustering approach, where each data point can belong to multiple clusters with different probabilities.

My questions are:

Is it a valid approach combining Categorical Naive Bayes Classification and Expectation Maximization algorithm for soft clustering of categorical data? Are there any theoretical or practical issues with combining Categorical Naive Bayes and EM in this way?
Are there better alternatives for soft clustering of purely categorical data? I've come across methods like fuzzy k-modes and latent class analysis, but I'm not sure how they compare.
Any suggestions for improving the code or the overall approach?

As a first attempt, I tried combining the Categorical Naive Bayes classifier with the Expectation Maximization (EM) algorithm. Here's the code I used:

import numpy as np
from sklearn.naive_bayes import CategoricalNB

def EM_CategoricalNB(X, k, max_iters=100, tol=1e-4, verbose=False):
    """
    EM algorithm using CategoricalNB for clustering categorical data.
    
    Parameters:
        X (np.array): Data array, each row is a data point, and values are integer-encoded categories.
        k (int): Number of clusters.
        max_iters (int): Maximum number of iterations for EM to converge.
        tol (float): Tolerance for checking convergence.
        verbose (bool): Whether to print status messages.
    
    Returns:
        model (CategoricalNB): Trained Naive Bayes model.
        assignments (np.array): Cluster assignments of data points.
    """
    # Initialize Categorical Naive Bayes model
    model = CategoricalNB()
    assignments = np.random.randint(0, k, size=len(X))
    model.fit(X, assignments)

    previous_log_likelihood = -np.inf

    for iteration in range(max_iters):
        # E-step: Estimate membership probabilities
        probabilities = model.predict_proba(X)

        # M-step: Re-fit the model using the expected memberships as soft assignments
        model.fit(X, np.argmax(probabilities, axis=1))

        # Calculate log likelihood for convergence check
        log_likelihood = model.score(X, np.argmax(probabilities, axis=1))
        if verbose:
            print(f"Iteration {iteration}: Log Likelihood = {log_likelihood}")

        # Check for convergence
        if np.abs(log_likelihood - previous_log_likelihood) < tol:
            if verbose:
                print("Convergence reached.")
            break
        previous_log_likelihood = log_likelihood

    # Final assignments
    final_assignments = model.predict(X)
    return model, final_assignments

The idea is to initialize the Categorical Naive Bayes model with random cluster assignments, and then iteratively:

Estimate the membership probabilities of each data point to each cluster (E-step)
Re-fit the model using the expected memberships as soft assignments (M-step)
Check for convergence based on the change in log likelihood

Categorical Naive Bayes and EM for soft clustering categorical data

Answers (0)

Related Questions