SteveS
SteveS

Reputation: 4040

Calculate similarity of 1-row dataframe and a large dataframe with the same columns in Python?

I have a very large dataframe (millions of rows) and every time I am getting a 1-row dataframe with the same columns. For example:

df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,-1], 'c': [-1,0.4,31]})
input = pd.DataFrame([[11, -0.44, 4]], columns=list('abc'))

I would like to calculate cosine similarity between the input and the whole df. I am using the following:

from scipy.spatial.distance import cosine
df.apply(lambda row: 1 - cosine(row, input), axis=1)

But it's a bit slow. Tried with swifter package, and it seems to run faster. Please advise what is the best practice for such a task, do it like this or change to another method?

Upvotes: 0

Views: 362

Answers (1)

Raymond Kwok
Raymond Kwok

Reputation: 2541

I usually don't do matrix manipulation with DataFrame but with numpy.array. So I will first convert them

df_npy = df.values
input_npy = input.values

And then I don't want to use scipy.spatial.distance.cosine so I will take care of the calculation myself, which is to first normalize each of the vectors

df_npy = df_npy / np.linalg.norm(df_npy, axis=1, keepdims=True)
input_npy = input_npy / np.linalg.norm(input_npy, axis=1, keepdims=True)

And then matrix multiply them together

df_npy @ input_npy.T

which will give you

array([[0.213],
       [0.524],
       [0.431]])

The reason I don't want to use scipy.spatial.distance.cosine is that it only takes care of one pair of vector at a time, but in the way I show, it takes care of all at the same time.

Upvotes: 1

Related Questions