Reputation:
I wanna use cosine similarity in my IR project but because the size of vectors are big and it must multiply floats many times, it takes a long time.
is there any way to calculate cosine similarity faster?
here is my code:
private double diffrence(HashMap<Integer, Float> hashMap,
HashMap<Integer, Float> hashMap2 ) {
Integer[] keys = new Integer[hashMap.size()];
hashMap.keySet().toArray(keys);
float ans = 0;
for (int i = 0; i < keys.length; i++) {
if (hashMap2.containsKey(keys[i])) {
ans += hashMap.get(keys[i]) * hashMap2.get(keys[i]);
}
}
float hashLength = 0;
for (int i = 0; i < keys.length; i++) {
hashLength += (hashMap.get(keys[i]) * hashMap.get(keys[i]));
}
hashLength = (float) Math.sqrt(hashLength);
Integer[] keys2 = new Integer[hashMap2.size()];
hashMap2.keySet().toArray(keys2);
float hash2Length = 0;
for (int i = 0; i < keys2.length; i++) {
hash2Length += hashMap2.get(keys2[i]) * hashMap2.get(keys2[i]);
}
hash2Length = (float) Math.sqrt(hash2Length);
return (float) (ans /(hash2Length*hashLength));
}
Upvotes: 1
Views: 7946
Reputation: 211
You can check with the project simbase https://github.com/guokr/simbase , it is a vector similarity nosql database.
Simbase use below concepts:
The write operation is handled in a single thread per basis, and comparison between any two vectors is needed, so the write operation is scaled at O(n).
We had a non-final performance test for the dense vectors on an i7-cpu Macbook, it can easily handle 100k 1k-dimensional vectors with each write operation in under 0.14 sec; and if the linear scale ratio can hold, it means Simbase can handle 700k dense vectors with each write operation in under 1 sec.
Upvotes: 0
Reputation: 18675
in addition to pre-normalizing your vectors as others suggested already and assuming your list of vectors is not changing, transform them to pairs of arrays once (outside the similarity function) and sort them by the key index, e.g.:
Integer[] keys = new Integer[hashMap.size()];
Float values[] = new Float[keys.size()];
int i = 0;
float norm = ...;
for (Map.Entry<Integer, Float> entry : new TreeMap<Integer, Float>(hashMap).entrySet())
{
keys[i] = entry.getKey();
values[i++] = entry.getValue() / norm;
}
then to do the actual similarity calculation (assuming you then pass keys1
, values
, keys2
, values2
instead of two HashMaps
), your innermost loop reduces to:
float ans = 0;
int i,j = 0;
while (i < keys1.length && j < keys2.length)
{
if (keys1[i] < keys2[j])
++i;
else if (keys1[i] > keys2[j])
++j;
else
// we have the same key in 1 and 2
ans += values1[i] * values2[j];
}
You could even consider to store all keys
and values
of all vectors consecutively in a large array of int
and float
, keeping another array with indices into the first positions:
int sumOfAllVectorLengths = ...;
int allKeys[] = new int[sumOfAllVectorLengths];
float allValues[] = new float[sumOfAllVectorLengths];
int firstPos = new int[numberOfVectors + 1];
firstPos[numberOfVectors] = sumOfAllVectorLengths;
int nextFirstPos = 0;
int index = 0;
for (HashMap<Integer, Float> vector : allVectors)
{
firstPos[index] = nextFirstPos;
float norm = ...;
for (Map.Entry<Integer, Float> entry : new TreeMap<Integer, Float>(hashMap).entrySet())
{
keys[nextFirstPos] = entry.getKey();
values[nextFirstPos++] = entry.getValue() / norm;
}
++index;
}
and then just pass the arrays and the indices of the vectors to the comparison function.
Upvotes: 1
Reputation: 363567
Typically in IR, one vector has far fewer non-zero elements than the other (and usually the query vector is the sparser one, but this is true even for document vectors). You can save time by looping over the keys of the sparser vector, i.e. the smaller hash map, looking them up in the larger one.
As for pkacprzak's suggestion of a lookup table and your lack of memory: realize that normalization can be done prior to the cosine similarity computations. For each vector, before storing it, compute its norm and divide every element by that. Then, you can just compute a dot product and get a cosine similarity out.
I.e., cosine similarity is usually defined as
x·y / (||x|| × ||y||)
but that's equal to
(x / ||x||) · (y / ||y||)
where /
is element-wise division. If you each replace x
by x / ||x||
, then you only need to compute x·y
.
If you combine these two bits of advice, you get a cosine similarity algorithm that takes just one loop over the smaller of the two inputs.
Further improvements can be made by using smarter sparse vector structures; hash tables have a lot of overhead both in lookup and iteration.
Upvotes: 9
Reputation: 4511
I can clearly see at least one place, where you are simply wasting CPU cycles:
for (int i = 0; i < keys.length; i++) {
if (hashMap2.containsKey(keys[i])) {
ans += hashMap.get(keys[i]) * hashMap2.get(keys[i]);
}
}
float hashLength = 0;
for (int i = 0; i < keys.length; i++) {
hashLength += (hashMap.get(keys[i]) * hashMap.get(keys[i]));
}
here you have 2 cycles of the same bounds on the same 2 hashMaps. Why won't you just do it in one cycle:
float hashLength = 0;
int hm = 0;
for (int i = 0; i < keys.length; i++) {
hm = hashMap.get(keys[i])*hashMap2.get(keys[i]);
hashLength += hm;
if (hashMap2.containsKey(keys[i])) {
ans += hm;
}
}
By the way, is there any special reason of using hashMap? Or you can do with some simpler sort of array?
Upvotes: -2
Reputation: 5629
Usually there are too many vectors to precompute the cosine similarity of each pair, but you could precompute the length of every vector and store it using a lookup table. This reduces a constant factor in computing the cosine similarity of two vectors - actually it saves a significant amount of time, because of a lot of floating point operations.
I'm assuming that you are not wasting memory by storing zeros in the vector.
Upvotes: 1