Mridula Gunturi
Mridula Gunturi

Reputation: 187

BIC score graph for GMM clustering looks very odd

BIC curve after GMM clustering

I want to use BIC criterion to find the optimal number of clusters for GMM clustering. I plotted the BIC scores for cluster numbers 2 to 41, and get the attached curve. I have no idea how to interpret this, can someone help?

For reference, this is the code I used to do GMM clustering. It is applied to daily wind vector data over a region, totaling approximately 5,500 columns and 13,880 rows.

def gmm_clusters(df_std, dates):
    ks = range(2, 44, 3)
    bic_scores = []
    csv_files = []
    for k in ks:
        model = GaussianMixture(n_components=k,
                                n_init=1,
                                init_params='random',
                                covariance_type='full',
                                verbose=0,
                                random_state=123)
        fitted_model = model.fit(df_std)
        bic_score = fitted_model.bic(df_std)
        bic_scores.append(bic_score)
        labels = fitted_model.predict(df_std)
        print("Labels counts")
        print(np.bincount(labels))
        df_label = pandas.DataFrame(df_std)
        print("############ dataframe AFTER CLUSTERING ###############")
        df_dates = pandas.DataFrame(dates)
        df_dates.columns = ['Date']
        df_dates = df_dates.reset_index(drop=True)
        df_label = df_label.join(df_dates)
        df_label["Cluster"] = labels
        print(df_label)
        csv_file = "{0}_GMM_2_Countries_850hPa.csv".format(k)
        df_label.to_csv(csv_file)
        csv_files.append(csv_file)

    return ks, bic_scores, csv_files

Thank you!!

EDIT: Using K-means on the same data, I get this elbow plot (plot of SSE): enter image description here This is fairly clear to interpret, indicating that 11 clusters is the optimum.

Upvotes: 0

Views: 1416

Answers (1)

Evgeny Tanhilevich
Evgeny Tanhilevich

Reputation: 1194

The first thing that springs to mind is check the numbers of clusters below 10 with a step of 1, not 3. Maybe there is a dip in BIC you are missing there.

The second thing is maybe check aic vs bic. See here: https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other

The third thing is that your dataset has 5,500 dimensions, but only 13,880 points. There is less than 3 points per dimension. I would be surprised to find any clustering at all (which is what the BIC chart is indicating). You'd need to tell more about the data and what each column means and what clustering you are looking for.

Upvotes: 1

Related Questions