Reputation: 865
How calculating pairwise cosine similarities for the huge amount of vectors can be optimized(estimates suits)?
For two sets (A, B), containing vectors - pairwise cosine similarity sim(a_i, b_j) need to be generated for every a and b. (Cosine similarity matrix also suits since it's easy to transform from a matrix to pairwise.)
It looks like a common problem because of the need for calculation of such distances in computational biology, recommendation systems, etc. But I haven't managed to find some reasonable solution for this.
By definition complexity of this problem is O(len_A * len_B * O(similarity_function)) so 10^6 vectors in both A and B sets tend to huge running time
It looks like, we are doing a lot of useless work here since similarities aren't independent (if we have a similarity of a_i calculated for million vectors, and b_j very similar to a_i - and we have b_j similarity for 900k of vectors calculated we can estimate b_j similarity to rest 100k vectors). I assume that something like indexing may be used here.
Your entries are appreciated.
python==3.6
pandas==0.25.0
scikit-learn==0.21.3
numpy==1.17.1
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
df_1 = pd.DataFrame({'object_id_1': range(10),
'feature_0': np.random.uniform(0,1,10),
'feature_1': np.random.uniform(0,1,10),
'feature_2': np.random.uniform(0,1,10),
'feature_3':np.random.uniform(0,1,10)})
df_2 = pd.DataFrame({'object_id_2': range(10,20),
'feature_0': np.random.uniform(0,1,10),
'feature_1': np.random.uniform(0,1,10),
'feature_2': np.random.uniform(0,1,10),
'feature_3':np.random.uniform(0,1,10)})
def get_similarities(df_1: pd.DataFrame, df_2: pd.DataFrame, meaningful_features:list) -> pd.DataFrame:
'''
This function generates features based similarity scores, between two groups of objects
Parameters
----------
df_1: pandas.DataFrame
DataFrame with features, and id_s of objects
df_2: pandas.DataFrame
DataFrame with features, and id_s of objects which has no id_s same to df_1
meaningful_features: list
Features columns to calculate similarity on
Returns
----------
similarities_of_objects: pandas.DataFrame
DataFrame, with columns 'object_id_1', 'object_id_2', 'similarity',
where we have features similarity, for each object_1-object_2 pair.
Similarity - symmetric.
'''
objects_1 = [] # list of all objects from df_1
objects_2 = [] # list of all objects from df_2
similarities = [] # list of scores for object_1-object_2 pairs
for object_1 in df_1['object_id_1'].unique():
features_vector_1 = df_1[df_1['object_id_1'] == object_1][meaningful_features] # object_1 features vector
for object_2 in df_2['object_id_2'].unique():
features_vector_2 = df_2[df_2['object_id_2'] == object_2][meaningful_features] # object_2 features vector
objects_1.append(object_1)
objects_2.append(object_2)
similarities.append(cosine_similarity(X = np.array(features_vector_1)
,Y = np.array(features_vector_2)).item()) # similarities of vectors
sim_o1_to_o2 = pd.DataFrame()
sim_o1_to_o2['objects_1']= objects_1
sim_o1_to_o2['objects_2']= objects_2
sim_o1_to_o2['similarity']= similarities
return sim_o1_to_o2
get_similarities(df_1,df_2, ['feature_0', 'feature_1', 'feature_2'])
Upvotes: 2
Views: 3773
Reputation: 865
Here's, also alternative ways for calculating euclidian distance especialy relevant for cases, when you need only top similar vectors, not entire similarity matrix.
Here's the solution to the exact problem I've posted, using a method proposed by @Abhik Sarkar. To have a cosine similarity make sure that your vectors are previously normalised. This solution also allows you to generate as many similarities, as you want, not necessary full matrix.
Disclaimer: Solution is focused on readability, not performance.
Requirements
python==3.6
pandas==0.25.0
numpy==1.17.1
faiss==1.5.3
Generating dummy data
import pandas as pd
import numpy as np
import faiss
df_1 = pd.DataFrame({'object_id_1': range(10),
'feature_0': np.random.uniform(0,1,10),
'feature_1': np.random.uniform(0,1,10),
'feature_2': np.random.uniform(0,1,10),
'feature_3':np.random.uniform(0,1,10)})
df_2 = pd.DataFrame({'object_id_2': range(10,20),
'feature_0': np.random.uniform(0,1,10),
'feature_1': np.random.uniform(0,1,10),
'feature_2': np.random.uniform(0,1,10),
'feature_3':np.random.uniform(0,1,10)})
Function for similarities generation
def get_similarities(df_1: pd.DataFrame,
df_2: pd.DataFrame,
meaningful_features:list,
n_neighbors:int = df_2.shape[0])->pd.DataFrame:
'''
This function generates features based similarity scores, between to groups of reviews
Parameters
----------
df_1: pandas.DataFrame
DataFrame with features, and id_s of objects
df_2: pandas.DataFrame
DataFrame with features, and id_s of objects which has no id_s same to df_1
meaningful_features: list
Features columns to calculate similarity on
n_neighbors: int
Number of most similar objects_2 for every object_1. By default - full similarity matrix generated.
(default = df_2.shape[0])
Returns
----------
similarities_of_objects: pandas.DataFrame
DataFrame, with columns 'object_id_1', 'object_id_2', 'similarity',
where we have features similarity, for each object_1-object_2 pair.
Similarity - symmetric.
'''
d = len(meaningful_features) # dimensionality
res = np.empty(shape=[1, 3]) # res initialization
xb = np.float32(df_1[meaningful_features].values)
xb = np.ascontiguousarray(xb)
xq = np.float32(df_2[meaningful_features].values)
xq = np.ascontiguousarray(xq)
index = faiss.IndexFlatL2(d) # build the index
index.add(xb) # add vectors to the index
D, I = index.search(xq, n_neighbors) # actual search
for i in range(I.shape[0]):
object_id_1_v = [df_1["object_id_1"].iloc[i]]*n_neighbors
object_id_2_v = df_2["object_id_2"].iloc[I[i]]
similarities = 1-D[i]/2
neighbors_scores_for_target = np.stack((object_id_1_v, object_id_2_v, similarities), axis=-1)
res = np.concatenate((res, neighbors_scores_for_target))
res = res[1:] # remove line we've created during res initialization
resulting_df = pd.DataFrame({'object_id_1': res[:, 0],
'object_id_2': res[:, 1],
'similarity': res[:, 2] })
return resulting_df
Generate similarities
get_similarities(df_1,df_2, ['feature_0', 'feature_1', 'feature_2'])
Upvotes: 2
Reputation: 965
Use Faiss
import faiss
dimension = 100
value1 = np.random.random((n, dimension)).astype('float32')
index = faiss.IndexFlatL2(d)
index.add(value1)
xq = value2
k= len(value1)
D, I = index.search(xq, k)
Note that here D is the distance, and I is the Index of the values.
Also, value1 and value2 are nothing but NumPy arrays.
PS: Install faiss first.
pip install faiss
Upvotes: 3