how to correlate noise data of sklearn-DBSCAN result with other clusters?

Question

I am using sklearn-DBSCAN to cluster my text data. I used GoogleNews-vectors-negative300.bin to create 300 dimensional sentence vectors for each document and created metrics of size 10000*300. when I passed metrics to DBSCAN with few possible values of eps (0.2 to 3) & min_samples (5 to 100) with other default parameters, getting numbers of clusters (200 to 10). As I analyzed for all the clusters noise data is approx 75-80% of my data. Is there any way to reduce noise or use some other parameters (distances) to reduce noise? Even I checked with euclidean distance between 2 vectors is 0.6 but both are in different clusters, how can I manage to bring in same cluster?

X_scaled = scaler.fit_transform(sentence_vectors)
ep = 0.3
min_sam = 10
for itr in range(1,11):
    dbscan = DBSCAN(eps=ep, min_samples = min_sam*itr)
    clusters = dbscan.fit_predict(X_scaled)

how to correlate noise data of sklearn-DBSCAN result with other clusters?

Answers (1)

Related Questions