Jimmy C
Jimmy C

Reputation: 9680

Efficient cosine distance calculation

I want to calculate the nearest cosine neighbors of a vector from the rows of a matrix, and have been testing the performance of a few Python functions for doing this.

def cos_loop_spatial(matrix, vector):
    """
    Calculating pairwise cosine distance using a common for loop with the numpy cosine function.
    """
    neighbors = []
    for row in range(matrix.shape[0]):
        neighbors.append(scipy.spatial.distance.cosine(vector, matrix[row,:]))
    return neighbors

def cos_loop(matrix, vector):
    """
    Calculating pairwise cosine distance using a common for loop with manually calculated cosine value.
    """
    neighbors = []
    for row in range(matrix.shape[0]):
        vector_norm = np.linalg.norm(vector)
        row_norm = np.linalg.norm(matrix[row,:])
        cos_val = vector.dot(matrix[row,:]) / (vector_norm * row_norm)
        neighbors.append(cos_val)
    return neighbors

def cos_matrix_multiplication(matrix, vector):
    """
    Calculating pairwise cosine distance using matrix vector multiplication.
    """
    dotted = matrix.dot(vector)
    matrix_norms = np.linalg.norm(matrix, axis=1)
    vector_norm = np.linalg.norm(vector)
    matrix_vector_norms = np.multiply(matrix_norms, vector_norm)
    neighbors = np.divide(dotted, matrix_vector_norms)
    return neighbors

cos_functions = [cos_loop_spatial, cos_loop, cos_matrix_multiplication]

# Test performance and plot the best results of each function
mat = np.random.randn(1000,1000)
vec = np.random.randn(1000)
cos_performance = {}
for func in cos_functions:
    func_performance = %timeit -o func(mat, vec)
    cos_performance[func.__name__] = func_performance.best

pd.Series(cos_performance).plot(kind='bar')

result

The cos_matrix_multiplication function is clearly the fastest of these, but I'm wondering if you have suggestions of further efficiency improvements for matrix vector cosine distance calculations.

Upvotes: 1

Views: 1570

Answers (1)

Yanshuai Cao
Yanshuai Cao

Reputation: 1297

use scipy.spatial.distance.cdist(mat, vec[np.newaxis,:], metric='cosine'), basically computes pairwise distance between every pairs of the two collections of vectors, represented by rows of the two input matrices.

Upvotes: 3

Related Questions