Stas Buzuluk
Stas Buzuluk

Reputation: 865

Pairwise similarity/similarity matrix calculation optimization

Problem definition


Question

How calculating pairwise cosine similarities for the huge amount of vectors can be optimized(estimates suits)?

Formal definition

For two sets (A, B), containing vectors - pairwise cosine similarity sim(a_i, b_j) need to be generated for every a and b. (Cosine similarity matrix also suits since it's easy to transform from a matrix to pairwise.)


Why am I asking for help


It looks like a common problem because of the need for calculation of such distances in computational biology, recommendation systems, etc. But I haven't managed to find some reasonable solution for this.

Problem I can't solve

By definition complexity of this problem is O(len_A * len_B * O(similarity_function)) so 10^6 vectors in both A and B sets tend to huge running time

My assumptions for future direction

It looks like, we are doing a lot of useless work here since similarities aren't independent (if we have a similarity of a_i calculated for million vectors, and b_j very similar to a_i - and we have b_j similarity for 900k of vectors calculated we can estimate b_j similarity to rest 100k vectors). I assume that something like indexing may be used here.



Additional details


  1. A and B aren't intersecting.
  2. Vectors dimensionality is already reduced to a minimal reasonable value.
  3. There's no need in simple for loop optimization. Briefly - here's short guide for optimization of this - simplest loops given for clear illustration of an algorithm.
  4. I'm interested if there's an algorithm which allows estimation as well, so it's ok if we have similarity close enough but not exactly the same to real.
  5. There's no need in parallelization.
  6. I understand that the resulting similarity matrix will be large in size.
  7. I'm also interested if that's an algorithm that allows getting only top similar vectors from set B for every vector from set A.

Your entries are appreciated.


Code samples


Requirements

python==3.6
pandas==0.25.0
scikit-learn==0.21.3
numpy==1.17.1

Generating dummy data

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity 

df_1 = pd.DataFrame({'object_id_1': range(10),
                   'feature_0': np.random.uniform(0,1,10),
                   'feature_1': np.random.uniform(0,1,10),
                   'feature_2': np.random.uniform(0,1,10),
                   'feature_3':np.random.uniform(0,1,10)})

df_2 = pd.DataFrame({'object_id_2': range(10,20),
                   'feature_0': np.random.uniform(0,1,10),
                   'feature_1': np.random.uniform(0,1,10),
                   'feature_2': np.random.uniform(0,1,10),
                   'feature_3':np.random.uniform(0,1,10)})

Function for similarities generation

def get_similarities(df_1: pd.DataFrame, df_2: pd.DataFrame, meaningful_features:list) -> pd.DataFrame:
    '''
    This function generates features based similarity scores, between two groups of objects
    
    Parameters
    ----------
    df_1: pandas.DataFrame
        DataFrame with features, and id_s of objects
    df_2: pandas.DataFrame
        DataFrame with features, and id_s of objects which has no id_s same to df_1
    meaningful_features: list
        Features columns to calculate similarity on
        
    Returns
    ----------
        similarities_of_objects: pandas.DataFrame
            DataFrame, with columns 'object_id_1', 'object_id_2', 'similarity', 
            where we have features similarity, for each object_1-object_2 pair. 
            Similarity - symmetric.  
    '''

    objects_1 = [] #  list of all objects from df_1
    objects_2 = [] #  list of all objects from df_2
    similarities = [] #  list of scores for object_1-object_2 pairs

    for object_1 in df_1['object_id_1'].unique():
        features_vector_1 = df_1[df_1['object_id_1'] == object_1][meaningful_features] # object_1 features vector
        
        for object_2 in df_2['object_id_2'].unique():
            features_vector_2 = df_2[df_2['object_id_2'] == object_2][meaningful_features] # object_2 features vector
            
            objects_1.append(object_1)
            objects_2.append(object_2)
            similarities.append(cosine_similarity(X = np.array(features_vector_1)
                                    ,Y = np.array(features_vector_2)).item()) # similarities of vectors 
    
    sim_o1_to_o2 = pd.DataFrame()

    sim_o1_to_o2['objects_1']= objects_1
    sim_o1_to_o2['objects_2']= objects_2
    sim_o1_to_o2['similarity']= similarities

    return sim_o1_to_o2

Generate similarities

get_similarities(df_1,df_2, ['feature_0', 'feature_1', 'feature_2'])

Upvotes: 2

Views: 3773

Answers (2)

Stas Buzuluk
Stas Buzuluk

Reputation: 865

How to get cosine similarity from euclidian distance


enter image description here

For top similar vectors only


Here's, also alternative ways for calculating euclidian distance especialy relevant for cases, when you need only top similar vectors, not entire similarity matrix.

Solution with the method proposed by @Abhik Sarka


Here's the solution to the exact problem I've posted, using a method proposed by @Abhik Sarkar. To have a cosine similarity make sure that your vectors are previously normalised. This solution also allows you to generate as many similarities, as you want, not necessary full matrix.

Disclaimer: Solution is focused on readability, not performance.

Requirements

python==3.6
pandas==0.25.0
numpy==1.17.1
faiss==1.5.3 

Generating dummy data

import pandas as pd
import numpy as np
import faiss 

df_1 = pd.DataFrame({'object_id_1': range(10),
                   'feature_0': np.random.uniform(0,1,10),
                   'feature_1': np.random.uniform(0,1,10),
                   'feature_2': np.random.uniform(0,1,10),
                   'feature_3':np.random.uniform(0,1,10)})

df_2 = pd.DataFrame({'object_id_2': range(10,20),
                   'feature_0': np.random.uniform(0,1,10),
                   'feature_1': np.random.uniform(0,1,10),
                   'feature_2': np.random.uniform(0,1,10),
                   'feature_3':np.random.uniform(0,1,10)})

Function for similarities generation

def get_similarities(df_1: pd.DataFrame, 
                     df_2: pd.DataFrame, 
                     meaningful_features:list, 
                     n_neighbors:int = df_2.shape[0])->pd.DataFrame:
    '''
    This function generates features based similarity scores, between to groups of reviews
    
    Parameters
    ----------
    df_1: pandas.DataFrame
        DataFrame with features, and id_s of objects
    df_2: pandas.DataFrame
        DataFrame with features, and id_s of objects which has no id_s same to df_1
    meaningful_features: list
        Features columns to calculate similarity on
    n_neighbors: int
        Number of most similar objects_2 for every object_1. By default - full similarity matrix generated.
        (default = df_2.shape[0]) 
    
    Returns
    ----------
        similarities_of_objects: pandas.DataFrame
            DataFrame, with columns 'object_id_1', 'object_id_2', 'similarity', 
            where we have features similarity, for each object_1-object_2 pair. 
            Similarity - symmetric.  
    '''
    d = len(meaningful_features) #  dimensionality
    
    res = np.empty(shape=[1, 3]) #  res initialization
    
    xb = np.float32(df_1[meaningful_features].values)
    xb = np.ascontiguousarray(xb)
    
    xq = np.float32(df_2[meaningful_features].values)
    xq = np.ascontiguousarray(xq)

    index = faiss.IndexFlatL2(d) #  build the index
    index.add(xb)                #  add vectors to the index
    
    D, I = index.search(xq, n_neighbors)     # actual search
    
    for i in range(I.shape[0]): 
        object_id_1_v = [df_1["object_id_1"].iloc[i]]*n_neighbors
        object_id_2_v = df_2["object_id_2"].iloc[I[i]]
        similarities = 1-D[i]/2
        
        neighbors_scores_for_target = np.stack((object_id_1_v, object_id_2_v, similarities), axis=-1)
        res = np.concatenate((res, neighbors_scores_for_target))
        
    res = res[1:] #  remove line we've created during res initialization
    
    resulting_df = pd.DataFrame({'object_id_1': res[:, 0], 
                                 'object_id_2': res[:, 1],
                                 'similarity':  res[:, 2] })

    
    return resulting_df

Generate similarities

get_similarities(df_1,df_2, ['feature_0', 'feature_1', 'feature_2'])

Upvotes: 2

Abhik Sarkar
Abhik Sarkar

Reputation: 965

Use Faiss

import faiss

dimension = 100

value1 = np.random.random((n, dimension)).astype('float32')
index = faiss.IndexFlatL2(d)
index.add(value1)

xq = value2
k= len(value1)
D, I = index.search(xq, k) 

Note that here D is the distance, and I is the Index of the values.

Also, value1 and value2 are nothing but NumPy arrays.

PS: Install faiss first.

pip install faiss

Upvotes: 3

Related Questions