row-wise calculation of cosine similarity in pandas without looping

Question

I have a pandas dataframe df with many rows. For each row, I want to calculate the cosinus similarity between the row's columns A (first vector) and the row's columns B (second vector). At the end, I aim to get a vector with one cosine similarity value for each row. I have found a solution but it seems to me like it could be done much faster without this loop. May anyone give me some feedback on this code? Thank you very much!


for row in np.unique(df.index):
            cos_sim[row]=scipy.spatial.distance.cosine(df[df.index==row][columnsA], 
                                          df[df.index==row][columnsB])

df['cos_sim']=cos_sim

Here comes some sample data:

df = pd.DataFrame({'featureA1': [2, 4, 1, 4],

                   'featureA2': [2, 4, 1, 4],

                   'featureB1': [10, 2, 1, 8]},

                   'featureB2': [10, 2, 1, 8]},

                  index=['Pit', 'Mat', 'Tim', 'Sam'])

columnsA=['featureA1', 'featureA2']
columnsB=['featureB1', 'featureB2']

This is my desired output (cosine similarity for Pit, Mat, Tim and Sam):

cos_sim=[1, 1, 1, 1]

I am already receiving this output with my method, but I am sure the code could be improved from a performance perspective

maow · Accepted Answer

several things you can improve on :)

Take a look at the DataFrame.apply function. pandas already offers you looping "under the hood".

df['cos_sim'] = df.apply(lambda _df: scipy.spatial.distance.cosine(_df[columnsA], _df[columnsB])

or something similar should be more performant

Also take a look at DataFrame.loc

df[df.index==row][columnsA]

and

df.loc[row,columnsA]

should be equivalent

If you really have to iterate over the dataframe (should be avoided again due to performance penalties and it is more difficult to read and understand), pandas gives you a generator for the rows (and id)

for index, row in df.iterrows():
    scipy.spatial.distance.cosine(row[columnsA], row[columnsB])

Finally as mentioned above to get better answers on stackoverflow, always provide a concrete example where the problem is reproducible. Otherwise it is much harder to interpret the question correctly and to test a solution.

row-wise calculation of cosine similarity in pandas without looping

Answers (2)

Related Questions