intodarkmoon
intodarkmoon

Reputation: 1

Calculate Cosine Similarity Sentences ValueError: Expected 2D array, got 1D array instead

So I'm doing a cosine similarity calculation on a list of sentences. I've got the embedding of the calculations done.

Here's the embedding

The shape of embedding (11, 3072)
[[-0.02179624 -0.17235152 -0.14017016 ...  0.33180898  0.13701975
  -0.2275123 ]
 [ 0.08176168  0.03396776 -0.00361721 ... -0.06099782 -0.1941497
   0.16414282]
 [ 0.01786027 -0.07074962  0.08268858 ... -0.15433213  0.22098969
  -0.05902294]
 ...
 [-0.33807683  0.06110802  0.32764304 ...  0.07062552 -0.2734855
  -0.01919978]
 [-0.09536518  0.04956777  0.64503926 ... -0.11085486 -0.36796266
   0.2826454 ]
 [-0.12355942 -0.1552269  -0.01554828 ... -0.14761439  0.17142747
  -0.02176587]]

and here's an example sentence.

document1 = ["sentence a", "sentence b", "sentence c", ...] # There are 11 sentence

I tried to calculate the similarity of each sentence using cosine similarity

# Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
sentences_2d = np.array(document1).reshape(-1,1)
similarity_matrix = np.zeros([len(document1), len(document1)])
for i in range(len(sentences_2d)):
  for j in range(len(sentences_2d)):
    if i != j:
      similarity_matrix[i][j] = cosine_similarity(arrcatembed[i], arrcatembed[j])

When I do a similarity calculation, I get an error like this,

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-46-e15cce98d633> in <module>
      6   for j in range(len(sentences_2d)):
      7     if i != j:
----> 8       similarity_matrix[i][j] = cosine_similarity(arrcatembed[i], arrcatembed[j])

2 frames
/usr/local/lib/python3.9/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    900             # If input is 1D raise error
    901             if array.ndim == 1:
--> 902                 raise ValueError(
    903                     "Expected 2D array, got 1D array instead:\narray={}.\n"
    904                     "Reshape your data either using array.reshape(-1, 1) if "

ValueError: Expected 2D array, got 1D array instead:
array=[-0.02179624 -0.17235152 -0.14017016 ...  0.33180898  0.13701975
 -0.2275123 ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

can anyone help to solve this problem? Thank You

So I want each sentence to get similarity results between the sentences in the list. For example, the first sentence with the second to eleven sentences, the second sentence with the first sentence, the third to eleven. Etc. As I have done with cosine distance

The shape (11, 11)
The length 11
[[1.         0.90366799 0.92140669 0.90678644 0.88496917 0.89278495
  0.93188739 0.87325549 0.88947386 0.86656564 0.90396279]
 [0.90366799 1.         0.91544878 0.95543408 0.93818021 0.94250894
  0.93432641 0.93418741 0.92931563 0.9156481  0.91719031]
 [0.92140669 0.91544878 1.         0.92346388 0.91356987 0.93290257
  0.94972414 0.90773791 0.92120057 0.90897304 0.92319667]
 [0.90678644 0.95543408 0.92346388 1.         0.94258463 0.95669407
  0.94972783 0.93550926 0.93902498 0.93075407 0.92586052]
 [0.88496917 0.93818021 0.91356987 0.94258463 1.         0.95144665
  0.92863572 0.95595235 0.9522922  0.94791383 0.94201249]
 [0.89278495 0.94250894 0.93290257 0.95669407 0.95144665 1.
  0.95301741 0.95989478 0.95237011 0.94007719 0.93626297]
 [0.93188739 0.93432641 0.94972414 0.94972783 0.92863572 0.95301741
  1.         0.92727625 0.93515086 0.92043686 0.92175251]
 [0.87325549 0.93418741 0.90773791 0.93550926 0.95595235 0.95989478
  0.92727625 1.         0.96572489 0.95371407 0.92973185]
 [0.88947386 0.92931563 0.92120057 0.93902498 0.9522922  0.95237011
  0.93515086 0.96572489 1.         0.95132333 0.9478088 ]
 [0.86656564 0.9156481  0.90897304 0.93075407 0.94791383 0.94007719
  0.92043686 0.95371407 0.95132333 1.         0.92758161]
 [0.90396279 0.91719031 0.92319667 0.92586052 0.94201249 0.93626297
  0.92175251 0.92973185 0.9478088  0.92758161 1.        ]]

Upvotes: 0

Views: 472

Answers (2)

maciek97x
maciek97x

Reputation: 7360

cosine_similarity expects input of shape (n_samples, n_features) and it returns 2d array of shape (n_samples, n_samples) so you don't have to use this nested loop - it already does it.

Your code should look like:

similarity_matrix = cosine_similarity(embeddings)

Upvotes: 0

CutePoison
CutePoison

Reputation: 5385

What I wanted to do, if I was you (which should speed it up) is;

  1. Normalize your sentence vectors such they have unit-norm
  2. Calculate the matrix/dot product between your sentences with sentences.T@sentences. The result is an nxn matrix where (i,j) is the similarity between sentence (i,j)

Upvotes: 0

Related Questions