valentine
valentine

Reputation: 57

How do I measure the goodness of cosine similarity scores across different vector spaces?

I am a computer scientist working on a problem that requires some statistical measures, though (not being very well versed in statistics) I am not quite sure what statistics to use.

Overview:

I have a set of questions (from StackExchange sites, of course) and with this data, I am exploring algorithms that will find similar questions to one I provide. Yes, StackExchange sites already perform this function, as do many other Q&A sites. What I am trying to do is analyze the methods and algorithms that people employ to accomplish this task to see which methods perform best. My problem is finding appropriate statistical measures to quantitatively determine "which methods perform best."

The Data:

I have a set of StackExchange questions, each of which is saved like this: {'questionID':"...", 'questionText':"..."}. For each question, I have a set of other questions either linked to it or from it. It is common practice for question answer-ers on StackExchange sites to add links to other similar posts in their answers, i.e. "Have you read this post [insert link to post here] by so-and-so? They're solving a similar problem..." I am considering these linked questions to be 'similar' to one another.

More concretely, let's say we have question A. Question A has a collection of linked questions {B, C, D}. So  A_linked = {B, C, D}. My intuition tells me that the transitive property does not apply here. That is, just because A is similar to B, and A is similar to C, I cannot confirm that B is similar to C. (Or can I?) However, I can confidently say that if A is similar to B, then B is similar to A.

So, to simplify these relationships, I will create a set of similar pairs: {A, B}, {A, C}, {A, D} These pairs will serve as a ground truth of sorts. These are questions we know are similar to one another, so their similarity confidence values equals 1. So similarityConfidence({A,B}) = 1

Something to note about this set-up is that we know only a few similar questions for each question in our dataset. What we don't know is whether some other question E is also similar to A. It might be similar, it might not be similar, we don't know. So our 'ground truth' is really only some of the truth.

The algorithm:

A simplified pseudocode version of the algorithm is this:

for q in questions: #remember q = {'questionID':"...", 'questionText':"..."}
    similarities = {} # will hold a mapping from questionID to similarity to q
    q_Vector = vectorize(q) # create a vector from question text (each word is a dimension, value is unimportant)
    for o in questions: #such that q!=o
        o_Vector = vectorize(o)
        similarities[o['questionID']] = cosineSimilarity(q_Vector,o_Vector) # values will be in the range of 1.0=identical to 0.0=not similar at all
    #now what???

So now I have a complete mapping of cosine similarity scores between q and every other question in my dataset. My ultimate goal is to run this code for many variations of the vectorize() function (each of which will return a slightly different vector) and determine which variation performs best in terms of cosine scores.

The Problem:

So here lies my question. Now what? How do I quantitatively measure how good these cosine scores are?

These are some ideas of measurements I've brainstormed (though I feel like they're unrefined, incomplete):

Do you have any other ideas? How can I quantitatively measure which vectorize() function performs best?

Upvotes: 1

Views: 5374

Answers (1)

amit
amit

Reputation: 178521

First note that this question is highly relevant to the Information Retrieval problem of similarity and near duplicate detection.

As far as I see it, your problem can be split to two problems:

  1. Determining ground truth: In many 'competitions', where the ground truth is unclear, the way to determine which are the relevant documents is by taking documents which were returned by X% of the candidates.
  2. Choosing the best candidate: first note that usually comparing scores of two different algorithms is irrelevant. The scales could be completely different, and it is usually pointless. In order to compare between two algorithms, you should use the ranking of each algorithm - how each algorithm ranks documents, and how far is it from the ground truth.
    A naive way to do it is simply using precision and recall - and you can compare them with the f-measure. Problem is, a document that is ranked 10th is as important as a document that is ranked 1st.
    A better way to do it is NDCG - this is the most common way to compare algorithms in most articles I have encountered, and is widely used in the main IR conferences: WWW, sigIR. NDCG is giving a score to a ranking, and giving high importance to documents that were ranked 'better', and reduced importance to documents that were ranked 'worse'. Another common variation is NDCG@k - where NDCG is used only up to the k'th document for each query.

Hope this background and advises help.

Upvotes: 1

Related Questions