shiva
shiva

Reputation:

How do I calculate the cosine similarity of two vectors?

How do I find the cosine similarity between vectors?

I need to find the similarity to measure the relatedness between two lines of text.

For example, I have two sentences like:

system for user interface

user interface machine

… and their respective vectors after tF-idf, followed by normalisation using LSI, for example [1,0.5] and [0.5,1].

How do I measure the smiliarity between these vectors?

Upvotes: 38

Views: 76715

Answers (7)

裴帅帅
裴帅帅

Reputation: 11

def cosineSimilarity(vectorA: Vector[Double], vectorB: Vector[Double]):Double={
    var dotProduct = 0.0
    var normA = 0.0
    var normB = 0.0
    var i = 0

    for(i <- vectorA.indices){
        dotProduct += vectorA(i) * vectorB(i)
        normA += Math.pow(vectorA(i), 2)
        normB += Math.pow(vectorB(i), 2)
    }

    dotProduct / (Math.sqrt(normA) * Math.sqrt(normB))
}

def main(args: Array[String]): Unit = {
    val vectorA = Array(1.0,2.0,3.0).toVector
    val vectorB = Array(4.0,5.0,6.0).toVector
    println(cosineSimilarity(vectorA, vectorA))
    println(cosineSimilarity(vectorA, vectorB))
}

scala version

Upvotes: 1

TG Gowda
TG Gowda

Reputation: 11957

For the sparse representation of vectors using Map(dimension -> magnitude) Here is a scala version (You can do similar stuff in Java 8)

def cosineSim(vec1:Map[Int,Int],
              vec2:Map[Int,Int]): Double ={
  val dotProduct:Double = vec1.keySet.intersect(vec2.keySet).toList
    .map(dim => vec1(dim) * vec2(dim)).sum
  val norm1:Double = vec1.values.map(mag => mag * mag).sum
  val norm2:Double = vec2.values.map(mag => mag * mag).sum
  return dotProduct / (Math.sqrt(norm1) * Math.sqrt(norm2))
}

Upvotes: 1

Toon Krijthe
Toon Krijthe

Reputation: 53416

Have a look at: http://en.wikipedia.org/wiki/Cosine_similarity.

If you have vectors A and B.

The similarity is defined as:

cosine(theta) = A . B / ||A|| ||B||

For a vector A = (a1, a2), ||A|| is defined as sqrt(a1^2 + a2^2)

For vector A = (a1, a2) and B = (b1, b2), A . B is defined as a1 b1 + a2 b2;

So for vector A = (a1, a2) and B = (b1, b2), the cosine similarity is given as:

  (a1 b1 + a2 b2) / sqrt(a1^2 + a2^2) sqrt(b1^2 + b2^2)

Example:

A = (1, 0.5), B = (0.5, 1)

cosine(theta) = (0.5 + 0.5) / sqrt(5/4) sqrt(5/4) = 4/5

Upvotes: 33

Anonymous
Anonymous

Reputation: 18631

When I was working with text mining some time ago, I was using the SimMetrics library which provides an extensive range of different metrics in Java. If it happened that you need more, then there is always R and CRAN to look at.

But coding it from the description in the Wikipedia is rather trivial task, and can be a nice exercise.

Upvotes: 2

Alphaaa
Alphaaa

Reputation: 4316

If you want to avoid relying on third-party libraries for such a simple task, here is a plain Java implementation:

public static double cosineSimilarity(double[] vectorA, double[] vectorB) {
    double dotProduct = 0.0;
    double normA = 0.0;
    double normB = 0.0;
    for (int i = 0; i < vectorA.length; i++) {
        dotProduct += vectorA[i] * vectorB[i];
        normA += Math.pow(vectorA[i], 2);
        normB += Math.pow(vectorB[i], 2);
    }   
    return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

Note that the function assumes that the two vectors have the same length. You may want to explictly check it for safety.

Upvotes: 72

Nick Fortescue
Nick Fortescue

Reputation: 44183

For matrix code in Java I'd recommend using the Colt library. If you have this, the code looks like (not tested or even compiled):

DoubleMatrix1D a = new DenseDoubleMatrix1D(new double[]{1,0.5}});
DoubleMatrix1D b = new DenseDoubleMatrix1D(new double[]{0.5,1}});
double cosineDistance = a.zDotProduct(b)/Math.sqrt(a.zDotProduct(a)*b.zDotProduct(b))

The code above could also be altered to use one of the Blas.dnrm2() methods or Algebra.DEFAULT.norm2() for the norm calculation. Exactly the same result, which is more readable depends on taste.

Upvotes: 5

Mark Davidson
Mark Davidson

Reputation: 5513

public class CosineSimilarity extends AbstractSimilarity {

  @Override
  protected double computeSimilarity(Matrix sourceDoc, Matrix targetDoc) {
    double dotProduct = sourceDoc.arrayTimes(targetDoc).norm1();
    double eucledianDist = sourceDoc.normF() * targetDoc.normF();
    return dotProduct / eucledianDist;
  }
}

I did some tf-idf stuff recently for my Information Retrieval unit at University. I used this Cosine Similarity method which uses Jama: Java Matrix Package.

For the full source code see IR Math with Java : Similarity Measures, really good resource that covers a good few different similarity measurements.

Upvotes: 22

Related Questions