How to quickly calculate cosine similarity for large number of vectors in Python?

Question

I have a set of 100 thousand vectors and I need to retrieve top-25 closest vector based on cosine similarity.

Scipy and Sklearn have implementations for computing cosine distance/similarity 2 vectors but I will need to compute the Cosine Sim for 100k X 100k size and then take out top-25. Is there any fast implemenet in python compute that?

As per @Silmathoron Suggestion, this is what I am doing -

#vectors is a list of vectors of size : 100K x 400 i.e. 100K vectors each of dimenions 400
vectors = numpy.array(vectors)  
similarity = numpy.dot(vectors, vectors.T)


# squared magnitude of preference vectors (number of occurrences)
square_mag = numpy.diag(similarity)

# inverse squared magnitude
inv_square_mag = 1 / square_mag

# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[numpy.isinf(inv_square_mag)] = 0

# inverse of the magnitude
inv_mag = numpy.sqrt(inv_square_mag)

# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
cosine = cosine.T * inv_mag

k = 26

box_plot_file = file("box_data.csv","w+")

for sim,query in itertools.izip(cosine,queries):
    k_largest = heapq.nlargest(k, sim)
    k_largest = map(str,k_largest)
    result = query + "," + ",".join(k_largest) + "
"
    box_plot_file.write(result)
box_plot_file.close()

ericf · Accepted Answer

I would try smarter algorithms first, rather than speeding up brute force (computing all pairs of vectors). KDTrees might work, scipy.spatial.KDTree(), if your vectors are of low dimension. If they are high dimension then you might need a random projection first: http://scikit-learn.org/stable/modules/random_projection.html

How to quickly calculate cosine similarity for large number of vectors in Python?

Answers (1)

Related Questions