SteveS
SteveS

Reputation: 4040

Calculate cosine similarity and output without duplicates?

I have the following vectors in my toy example:

data = pd.DataFrame({
            'id': [1, 2, 3, 4, 5],
            'a': [55, 2123, -19.3, 9, -8], 
            'b': [21, -0.1, 0.003, 4, 2.1]
        })

I have calculated similarity matrix (by excluding the id column).

from sklearn.metrics.pairwise import cosine_similarity

# Calculate the pairwise cosine similarities 
S = cosine_similarity(data.drop('id', axis=1))

T  = S.tolist()
df = pd.DataFrame.from_records(T)

It returns me a matrix/dataframe with all options including self similarity and duplicates. Is there any efficient method to calculate similarity without self similarities (vector is 100% similar to itself) and duplicates (vectors 1 and 2 has 89% similarity, I don't need vectors 2 and 1 similarity as it's the same).

Upvotes: 1

Views: 854

Answers (1)

tryingtoprogram
tryingtoprogram

Reputation: 21

The best solution I found so far is to take the lower triangle under the diagonal:

[In] S[np.triu_indices_from(S, k=1)]

[Out] array([ 0.93420158, -0.93416293,  0.99856978, -0.81303909, -0.99999999,
    0.91379242, -0.96724292, -0.91374841,  0.96727042, -0.78074903])

What this does is take only those values that are under the 1 diagonal, so basically excluding the ones and the repeating values. This gives you a numpy array, too.

Upvotes: 1

Related Questions