Johnny
Johnny

Reputation: 139

Most efficient way of computing pairwise cosine similarity for large DataFrame

I have a 300.000 row pd.DataFrame comprised of multiple columns, out of which, one is a 50-dimension numpy array of shape (1,50) like so:

ID                     Array1    
1         [2.4252 ... 5.6363] 
2         [3.1242 ... 9.0091] 
3         [6.6775 ... 12.958]  
...
300000    [0.1260 ... 5.3323]    

I then generate a new numpy array (let's call it array2) with the same shape and calculate the cosine similarity between each row of the dataframe and the generated array. For this, I am currently using sklearn.metrics.pairwise.cosine_similarity and save the results in a new column:

from sklearn.metrics.pairwise import cosine_similarity
df['Cosine'] = cosine_similarity(df['Array1].tolist(), array2)

Which works as intended and takes, on average, 2.5 seconds to execute. I am currently trying to lower this time to under 1 second simply for the sake of having less waiting time in the system I am building.

I am beginning to learn about Vaex and Dask as alternatives to pandas but am failing to convert the code I provided to a working equivalent that is also faster.

Preferably with one of the technologies I mentioned, how can I go about making pairwise cosine calculations even faster for large datasets?

Upvotes: 1

Views: 1072

Answers (1)

greenstreets2
greenstreets2

Reputation: 90

You could use Faiss here and apply a knn operation. To do this, you would put dataframe into a Faiss index and then search it using the array with k=3000000 (or whatever the total number of rows of your dataframe).

import faiss

dimension = 100

array1 = np.random.random((n, dimension)).astype('float32')


index = faiss.IndexFlatIP(d)
#add the rows of the dataframe into Faiss
for index, row in df.iterrows():  
    index.add(row)

k= len(df)
D, I = index.search(array1, k) 

Note that you'll need to normalise the vectors to make this work (as the above solution is based on inner product).

Upvotes: 1

Related Questions