Reputation: 392
I have a DataFrame containing multiple vectors each having 3 entries. Each row is a vector in my representation. I needed to calculate the cosine similarity between each of these vectors. Converting this to a matrix representation is better or is there a cleaner approach in DataFrame itself?
Here is the code that I have tried.
import pandas as pd
from scipy import spatial
df = pd.DataFrame([X,Y,Z]).T
similarities = df.values.tolist()
for x in similarities:
for y in similarities:
result = 1 - spatial.distance.cosine(x, y)
Upvotes: 26
Views: 62088
Reputation: 31
You can import pairwise_distances from sklearn.metrics.pairwise and pass the data-frame for which you want to calculate cosine similarity, and also pass the hyper-parameter metric='cosine', because by default the metric hyper-parameter is set to 'euclidean'.
DEMO
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import pairwise_distances
df = pd.DataFrame(np.random.randint(0, 5, (3, 5)))
df
## 0 1 2 3 4
## 0 4 2 1 3 2
## 1 3 2 0 0 1
## 2 3 3 4 2 4
pairwise_distances(df, metric='cosine')
##array([[2.22044605e-16, 1.74971353e-01, 1.59831950e-01],
## [1.74971353e-01, 0.00000000e+00, 3.08976681e-01],
## [1.59831950e-01, 3.08976681e-01, 0.00000000e+00]])
Upvotes: 2
Reputation: 29680
You can directly just use sklearn.metrics.pairwise.cosine_similarity
.
Demo
import numpy as np; import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame(np.random.randint(0, 2, (3, 5)))
df
## 0 1 2 3 4
## 0 1 1 1 0 0
## 1 0 0 1 1 1
## 2 0 1 0 1 0
cosine_similarity(df)
## array([[ 1. , 0.33333333, 0.40824829],
## [ 0.33333333, 1. , 0.40824829],
## [ 0.40824829, 0.40824829, 1. ]])
Upvotes: 52