Jamik
Jamik

Reputation: 85

Cosine similarity is slow

I have a set of sentences, which is encoded using sentence encoder into vectors and I want to find out the most similar sentence to an incoming query.

The search function looks as following:

def semantic_search(cleaned_query, data, vectors):
    query_vec = get_features(cleaned_query)[0].ravel()
    res = []
    for i, d in enumerate(data):
        qvec = vectors[i].ravel()
        sim = cosine_similarity(query_vec, qvec)
        if sim > 0.5:
            res.append((format(sim * 100, '.2f'), data[i]))
    return sorted(res, key=lambda x: x[0], reverse=True)[:15]

where cleaned_query is a preprocessed query in a string form, data is a list with all sentences (300 in total) and vectors contains encoded vectors for each sentence in data with dimensions (300,500).

When i send a query to my service is takes around 10-12 seconds to process one, which too slow, in my opinion. I have done some debugging and realized that the issue is in cosine_similarity function, which is implemented as following:

import numpy as np
def cosine_similarity(v1, v2):
    mag1 = np.linalg.norm(v1)
    mag2 = np.linalg.norm(v2)
    if (not mag1) or (not mag2):
        return 0
    return np.dot(v1, v2) / (mag1 * mag2)

I have tried to look into different implementations and found some that works quite fast using numba - nb_cosine, but it delivers not good results, meaning that the cosine_similarity, which is above, delivers more correct and meaningful results. Here is the implementation with numba:

import numba as nb
import numpy as np
@nb.jit(nopython=True, fastmath=True)
def nb_cosine(x, y):
    xx,yy,xy=0.0,0.0,0.0
    for i in range(len(x)):
        xx+=x[i]*x[i]
        yy+=y[i]*y[i]
        xy+=x[i]*y[i]
    return 1.0-xy/np.sqrt(xx*yy)

Can anyone suggest, how can I optimize my cosine_similarity function to work faster? The 300 sentences are always the same. And just in case, if needed, below is get_features function:

def get_features(texts):
    if type(texts) is str:
        texts = [texts]
    with tf.Session(graph=graph) as sess:
        sess.run([tf.global_variables_initializer(), tf.tables_initializer()])
        return sess.run(embed(texts))

Upvotes: 2

Views: 772

Answers (1)

Tom Zych
Tom Zych

Reputation: 13596

I’m not sure if you’re calculating the cosine similarity correctly there; you may want to check some values you’re getting and make sure they make sense.

Anyway, one way to speed things up would be to precalculate and store the magnitude of each vector for your 300 sentences, and also precalculate the magnitude of query_vec. As the code is now, you’re recalculating the magnitude of each sentence with every call, and calculating the magnitude of query_vec 300 times.

Upvotes: 0

Related Questions