Reu
Reu

Reputation: 1287

Fast vector difference/similarity measures


I'm working on a project where I am using genetic algorithms to generate word lists which best describe a text.
I'm presently using cosine similarity to do it, but it has two flaws: it's far too slow for purpose and that if two vectors being compared are zeroes, it ends up with an artificially high similarity and a word vector that isn't very good. Any suggestions for other measures which would be faster/take less notice of words that aren't there? Thanks.

Upvotes: 1

Views: 1137

Answers (1)

Jeffrey Hantin
Jeffrey Hantin

Reputation: 36524

Cosine similarity is dot-product over product-of-magnitudes, so minimizing number of dimensions is crucial.

To cull the herd a bit, you might want to apply stemming to collapse words with similar meaning into a single dimension, and toss out hapax legomena (words that only occur once in the corpus under consideration) from the dimension pool, since an algorithm isn't likely to be able to derive much useful information from them.

I'm not sure what would give rise to the zero vectors, though. Can you give an example?

EDIT: So what you're after is to create a word list that is selective for a particular document or cluster? In that case, you need some ways to eliminate low-selectivity words.

You might want to treat the most common words as stop words to further cull your dimension set and get back a little bit more performance. Also, on the genetic algorithm side, your fitness function needs to penalize word lists that match documents outside of the target cluster, not just reward those that match documents within the cluster, so your word list doesn't get cluttered with terms that are merely frequent rather than selective.

If you need better semantic selectivity even after tuning the fitness function., you might want to consider using orthogonal sparse bigrams instead of individual words. I've no idea what it'll do in terms of number of dimensions, though, because while there will be O(kn2) distinct terms instead of n, a lot more of them will be hapaxes. This may cause a problem if you need individual words instead of OSBs in your term lists though.

Upvotes: 3

Related Questions