Reputation: 1
I am using sklearn-DBSCAN to cluster my text data. I used GoogleNews-vectors-negative300.bin to create 300 dimensional sentence vectors for each document and created metrics of size 10000*300. when I passed metrics to DBSCAN with few possible values of eps (0.2 to 3) & min_samples (5 to 100) with other default parameters, getting numbers of clusters (200 to 10). As I analyzed for all the clusters noise data is approx 75-80% of my data. Is there any way to reduce noise or use some other parameters (distances) to reduce noise? Even I checked with euclidean distance between 2 vectors is 0.6 but both are in different clusters, how can I manage to bring in same cluster?
X_scaled = scaler.fit_transform(sentence_vectors)
ep = 0.3
min_sam = 10
for itr in range(1,11):
dbscan = DBSCAN(eps=ep, min_samples = min_sam*itr)
clusters = dbscan.fit_predict(X_scaled)
Upvotes: 0
Views: 399
Reputation: 77495
If you want two points at distance 0.6 to be in the same cluster, then you may need to use a larger epsilon (which is a distance threshold). At 0.6 they should be in the same cluster.
Since word2vec is trained with dot products, it would likely make more sense to use the dot product as similarity and/or cosine distance.
But in general I doubt you'll be able to get good results. The way sentence vectors are built by averaging word2vec vectors kills too much signal, and adds to much noise. And since the data is high-dimensional, all such noise is a problem.
Upvotes: 0