user2129623
user2129623

Reputation: 2257

Fine tuning hdbscan parameters for clustering text documents

I have text documents which am clustering using hdbsca. When I have laser amount data around 35 documents and correct values of clusters around 14, then using following paramters I am getting correct result.

def cluster_texts(textdict, eps=0.40,min_samples = 1):
    """
    cluster the given texts
    Input:
        textdict: dictionary with {docid: text}
    Returns:
        doccats: dictionary with {docid: cluster_id}
    """
    doc_ids = list(textdict.keys())
    # transform texts into length normalized kpca features
    ft = FeatureTransform(norm='max', weight=True, renorm='length', norm_num=False)
    docfeats = ft.texts2features(textdict)
    X, featurenames = features2mat(docfeats, doc_ids)
    e_lkpca = KernelPCA(n_components=12, kernel='linear')
    X = e_lkpca.fit_transform(X)
    xnorm = np.linalg.norm(X, axis=1)
    X = X/xnorm.reshape(X.shape[0], 1)
    # compute cosine similarity
    D = 1 - linear_kernel(X)
    # and cluster with dbscan
    clst = hdbscan.HDBSCAN(eps=eps, metric='precomputed', min_samples=min_samples,gen_min_span_tree=True,min_cluster_size=2)
    y_pred = clst.fit_predict(D)

    return {did: y_pred[i] for i, did in enumerate(doc_ids)}

Now I just replicated data, each document 100 times. And tried to finetune clustering, but now I am getting 36 cluster, each document in different cluster. I tried changing different parameters. but no change in clustering result.

Any suggestion or reference much appreciated.

Upvotes: 0

Views: 1717

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77474

Obviously if you replicate each point 100 times, you need to increase the minPts parameter 100x and the minimum cluster size, too.

But your main problem likely is KernelPCA - which is sensitive to the amount of samples you have - and not HDBSCAN.

Upvotes: 1

Related Questions