Reputation: 13
I have a pandas dataframe df with many rows. For each row, I want to calculate the cosinus similarity between the row's columns A (first vector) and the row's columns B (second vector). At the end, I aim to get a vector with one cosine similarity value for each row. I have found a solution but it seems to me like it could be done much faster without this loop. May anyone give me some feedback on this code? Thank you very much!
for row in np.unique(df.index):
cos_sim[row]=scipy.spatial.distance.cosine(df[df.index==row][columnsA],
df[df.index==row][columnsB])
df['cos_sim']=cos_sim
Here comes some sample data:
df = pd.DataFrame({'featureA1': [2, 4, 1, 4],
'featureA2': [2, 4, 1, 4],
'featureB1': [10, 2, 1, 8]},
'featureB2': [10, 2, 1, 8]},
index=['Pit', 'Mat', 'Tim', 'Sam'])
columnsA=['featureA1', 'featureA2']
columnsB=['featureB1', 'featureB2']
This is my desired output (cosine similarity for Pit, Mat, Tim and Sam):
cos_sim=[1, 1, 1, 1]
I am already receiving this output with my method, but I am sure the code could be improved from a performance perspective
Upvotes: 1
Views: 1891
Reputation: 410
Pretty old post but I am replying for future readers. I created https://github.com/ma7555/evalify for all those rowwise similarity/distance calculations (disclaimer: i am the owner of the package)
Upvotes: 0
Reputation: 2887
several things you can improve on :)
DataFrame.apply
function. pandas already offers you looping "under the hood".df['cos_sim'] = df.apply(lambda _df: scipy.spatial.distance.cosine(_df[columnsA], _df[columnsB])
or something similar should be more performant
DataFrame.loc
df[df.index==row][columnsA]
and
df.loc[row,columnsA]
should be equivalent
for index, row in df.iterrows():
scipy.spatial.distance.cosine(row[columnsA], row[columnsB])
Upvotes: 1