Reputation: 4040
I have a very large dataframe (millions of rows) and every time I am getting a 1-row dataframe with the same columns. For example:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,-1], 'c': [-1,0.4,31]})
input = pd.DataFrame([[11, -0.44, 4]], columns=list('abc'))
I would like to calculate cosine similarity between the input and the whole df. I am using the following:
from scipy.spatial.distance import cosine
df.apply(lambda row: 1 - cosine(row, input), axis=1)
But it's a bit slow. Tried with swifter package, and it seems to run faster. Please advise what is the best practice for such a task, do it like this or change to another method?
Upvotes: 0
Views: 362
Reputation: 2541
I usually don't do matrix manipulation with DataFrame
but with numpy.array
. So I will first convert them
df_npy = df.values
input_npy = input.values
And then I don't want to use scipy.spatial.distance.cosine
so I will take care of the calculation myself, which is to first normalize each of the vectors
df_npy = df_npy / np.linalg.norm(df_npy, axis=1, keepdims=True)
input_npy = input_npy / np.linalg.norm(input_npy, axis=1, keepdims=True)
And then matrix multiply them together
df_npy @ input_npy.T
which will give you
array([[0.213],
[0.524],
[0.431]])
The reason I don't want to use scipy.spatial.distance.cosine
is that it only takes care of one pair of vector at a time, but in the way I show, it takes care of all at the same time.
Upvotes: 1