What is the best algorithm to perform unsupervised text classification(clustering) using python scikit-learn?

I tried CountVectorizer + KMeans but I don't know the number of clusters. Calculating the number of clusters in KMeans took a lot of time when I used the gap statistic method. NMF requires determining the number of components beforehand too.

Upvotes: 2

Answers (1)

mbrg

Reputation: 558

There is no one algorithm which is best for unsupervised text classification. It depends on the data you have, what you are trying to achieve, etc'.

If you wish to avoid the number of clusters issue, you can try DBSCAN, which is a density-based clustering algorithm:

DBSCAN on Wikipedia: a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).

DBSCAN automatically finds the number of clusters by recursively connecting points to a nearby dense group of points (e.g. a cluster).

To use DBSCAN, the most important parameters to tune are epsilon (which controls the maximum distance to be considered a neighbor) and min_samples (the number of samples in a neighborhood to be considered a core point). Try starting with the default parameters sklearn provides, and tune them to get better results for your specific task.

Upvotes: 2

What is the best algorithm to perform unsupervised text classification(clustering) using python scikit-learn?

Answers (1)

Related Questions