Reputation: 35
i am trying to cluster word vectors using ELKI DBSCAN. I wish to use cosine distance to cluster the word vectors of 300 dimensions. The size of the dataset is 19,000 words (19000*300 size matrix). These are wordvectors computed using gensim word2vec and the list output is saved as a CSV
Below is the command i passed in the UI
KDDCLIApplication -dbc.in "D:\w2v\vectors.csv" -parser.colsep '","' -algorithm clustering.DBSCAN -algorithm.distancefunction CosineDistanceFunction -dbscan.epsilon 1.02 -dbscan.minpts 5 -vis.window.single
I played around with the epsilon value and while doing so i tried 3 values 0.8, 0.9, 1.0. For 0.8 & 0.9 - i got "There are very few neighbors found. Epsilon may be too small." while for 1.0 - i got "There are very many neighbors found. Epsilon may be too large."
What am i doing wrong here? I am quite new to ELKI so any help is appreciated
Upvotes: 1
Views: 590
Reputation: 8715
At 300 dimensions, you will be seeing the curse of dimensionality.
Contrary to popular claims, the curse of dimensionality does exist for cosine (as cosine is equivalent to Euclidean on normalized vectors, it can be at best 1 dimension "better" than Euclidean). What often makes cosine applications still work is that the intrinsic dimensionality is much less than the representation dimensionality on text (i.e., while your vocabulary may have a thousands of words, only few occur in the intersection of two documents).
Word vectors are usually not sparse, so your intrinsic dimension may be quite high, and you will see the curse of dimensionality.
So it is not surprising to see the Cosine distances to concentrate, and then you may need to choose a threshold with a few digits of precision.
For obvious reasons, 1.0 is a nonsense threshold for cosine distance. The maximum cosine distance is 1.0! So yes, you will need to try 0.95 and 0.99, for example.
You can use the KNNDistancesSampler to help you choose DBSCAN parameters, or you can use for example OPTICS (which will allow you to find clusters with different thresholds, not just one single threshold).
Beware that word vectors are trained for a very specific scenario: substitutability. They are by far not as universal as popularly interpreted based on the "king-man+woman=queen" example. Just try "king-man+boy", which often also returns "queen" (or "kings")... the result is mostly because that the nearest neighbors of king are "queen" and "kings". And the "capital" example is similarly overfitted due to the training data. It's trained on news articles, which often begin the text with "capital, country, blah blah". If you omit "capital", and if you omit "country", you get almost the exact same context. So the word2vec model learns that they are "substitutable". This works as long as the capital is also where the major US newspapers are based (e.g., Berlin, Paris). It often fails for countries like Canada, U.S., or Australia, where the main reporting hubs are located, e.g., in Toronto, New York, Sydney. And it does not really prove that the vectors have learned what a capital is. The reason that it worked in the first place is by overfitting on the news training data.
Upvotes: 1