Syed Arefinul Haque
Syed Arefinul Haque

Reputation: 1325

Cosine similarity between pair of arrays in Bigquery

I have created a table that has a pair of IDs and coordinate fro each of them so that I can calculate pairwise cosine similarity between them.

The table looks like thisenter image description here

The number of dimension for the coords are currently 128, but it can vary. But the number dimensions for a pair of ID are always same in the same table.

coord1 and coord2 are repeated field (array) with floating point values.

Is there a way to calculate cosine similarity between them?

My expected output would have three columns, id1, id2 and cosine_similarity.

Upvotes: 3

Views: 1938

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173046

Below is for BigQuery Standard SQL

#standardSQL
SELECT id1, id2, ( 
  SELECT 
    SUM(value1 * value2)/ 
    SQRT(SUM(value1 * value1))/ 
    SQRT(SUM(value2 * value2))
  FROM UNNEST(coord1) value1 WITH OFFSET pos1 
  JOIN UNNEST(coord2) value2 WITH OFFSET pos2 
  ON pos1 = pos2
  ) cosine_similarity
FROM `project.dataset.table`  

below is dummy example for you to play with

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id1, [1.0, 2.0, 3.0, 4.0] coord1, 2 id2, [1.0, 2.0, 3.0, 4.0] coord2 UNION ALL
  SELECT 3, [2.0, 0.0, 1.0, 1.0, 0, 2.0, 1.0, 1.0], 4, [2.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0]
)
SELECT id1, id2, ( 
  SELECT 
    SUM(value1 * value2)/
    SQRT(SUM(value1 * value1))/ 
    SQRT(SUM(value2 * value2))
  FROM UNNEST(coord1) value1 WITH OFFSET pos1 
  JOIN UNNEST(coord2) value2 WITH OFFSET pos2 
  ON pos1 = pos2
  ) cosine_similarity
FROM `project.dataset.table`  

with result

Row id1 id2 cosine_similarity    
1   1   2   1.0  
2   3   4   0.8215838362577491   

Upvotes: 8

Related Questions