Reputation: 139
I have a 300.000 row pd.DataFrame
comprised of multiple columns, out of which, one is a 50-dimension numpy
array of shape (1,50)
like so:
ID Array1
1 [2.4252 ... 5.6363]
2 [3.1242 ... 9.0091]
3 [6.6775 ... 12.958]
...
300000 [0.1260 ... 5.3323]
I then generate a new numpy
array (let's call it array2
) with the same shape and calculate the cosine similarity between each row of the dataframe and the generated array. For this, I am currently using sklearn.metrics.pairwise.cosine_similarity
and save the results in a new column:
from sklearn.metrics.pairwise import cosine_similarity
df['Cosine'] = cosine_similarity(df['Array1].tolist(), array2)
Which works as intended and takes, on average, 2.5 seconds to execute. I am currently trying to lower this time to under 1 second simply for the sake of having less waiting time in the system I am building.
I am beginning to learn about Vaex
and Dask
as alternatives to pandas
but am failing to convert the code I provided to a working equivalent that is also faster.
Preferably with one of the technologies I mentioned, how can I go about making pairwise cosine calculations even faster for large datasets?
Upvotes: 1
Views: 1072
Reputation: 90
You could use Faiss here and apply a knn operation. To do this, you would put dataframe into a Faiss index and then search it using the array with k=3000000 (or whatever the total number of rows of your dataframe).
import faiss
dimension = 100
array1 = np.random.random((n, dimension)).astype('float32')
index = faiss.IndexFlatIP(d)
#add the rows of the dataframe into Faiss
for index, row in df.iterrows():
index.add(row)
k= len(df)
D, I = index.search(array1, k)
Note that you'll need to normalise the vectors to make this work (as the above solution is based on inner product).
Upvotes: 1