conradlee
conradlee

Reputation: 13705

SQL: Inconsistent column values after self-join

I'm using BigQuery to compute pairwise distances of vectors. Each row has a vector of weights.

Here's the query. The first parts just generate random data. The important part is the final SELECT statement in which I compare each document to every other document via a self-join.

WITH input_data AS (
  SELECT
    document_id
  FROM UNNEST(["doc1", "doc2"]) AS document_id
),
exploded AS (
  SELECT
    document_id,
    STRUCT(dimension_id, rand() AS weight) AS weight
  FROM
    input_data CROSS JOIN UNNEST(GENERATE_ARRAY(1, 3)) AS dimension_id
),
rows_with_weights AS (
  SELECT 
    document_id,
    ARRAY_AGG(weight) as vector
  FROM 
    exploded 
  GROUP BY 
    document_id
),
SELECT 
  l.document_id as l_doc_id,
  r.document_id as r_doc_id,
  l.vector as l_vectors,
  r.vector as r_vectors
FROM 
  rows_with_weights AS l 
CROSS JOIN rows_with_weights AS r WHERE l.document_id <= r.document_id

You'll see that I also compare each document to itself, and here's where something unexpected happens: when a document is compared to itself, the values of l_vector are distinct from those of r_vector. Here's the output of the above query:

enter image description here

This is puzzling. I would expect these two vectors to be the same for the same document. What's going on here?

Upvotes: 1

Views: 105

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173046

What's going on here?

This is as expected!

BigQuery does not materialize the results of CTEs. If CTE is referenced in multiple places in a query, then the CTE is executed once for each reference. Which means that when you do self join you actually join two tables with different weights. Thus the result

I recommend to use temp tables instead so randomness will be materialized once and then you will get expected result

Upvotes: 2

Related Questions