Lea
Lea

Reputation: 13

row-wise calculation of cosine similarity in pandas without looping

I have a pandas dataframe df with many rows. For each row, I want to calculate the cosinus similarity between the row's columns A (first vector) and the row's columns B (second vector). At the end, I aim to get a vector with one cosine similarity value for each row. I have found a solution but it seems to me like it could be done much faster without this loop. May anyone give me some feedback on this code? Thank you very much!


for row in np.unique(df.index):
            cos_sim[row]=scipy.spatial.distance.cosine(df[df.index==row][columnsA], 
                                          df[df.index==row][columnsB])

df['cos_sim']=cos_sim

Here comes some sample data:

df = pd.DataFrame({'featureA1': [2, 4, 1, 4],

                   'featureA2': [2, 4, 1, 4],

                   'featureB1': [10, 2, 1, 8]},

                   'featureB2': [10, 2, 1, 8]},

                  index=['Pit', 'Mat', 'Tim', 'Sam'])

columnsA=['featureA1', 'featureA2']
columnsB=['featureB1', 'featureB2']

This is my desired output (cosine similarity for Pit, Mat, Tim and Sam):

cos_sim=[1, 1, 1, 1]

I am already receiving this output with my method, but I am sure the code could be improved from a performance perspective

Upvotes: 1

Views: 1891

Answers (2)

ma7555
ma7555

Reputation: 410

Pretty old post but I am replying for future readers. I created https://github.com/ma7555/evalify for all those rowwise similarity/distance calculations (disclaimer: i am the owner of the package)

Upvotes: 0

maow
maow

Reputation: 2887

several things you can improve on :)

  1. Take a look at the DataFrame.apply function. pandas already offers you looping "under the hood".
df['cos_sim'] = df.apply(lambda _df: scipy.spatial.distance.cosine(_df[columnsA], _df[columnsB])

or something similar should be more performant

  1. Also take a look at DataFrame.loc
df[df.index==row][columnsA]

and

df.loc[row,columnsA]

should be equivalent

  1. If you really have to iterate over the dataframe (should be avoided again due to performance penalties and it is more difficult to read and understand), pandas gives you a generator for the rows (and id)
for index, row in df.iterrows():
    scipy.spatial.distance.cosine(row[columnsA], row[columnsB])
  1. Finally as mentioned above to get better answers on stackoverflow, always provide a concrete example where the problem is reproducible. Otherwise it is much harder to interpret the question correctly and to test a solution.

Upvotes: 1

Related Questions