Reputation: 13705
I'm using BigQuery to compute pairwise distances of vectors. Each row has a vector of weights.
Here's the query. The first parts just generate random data. The important part is the final SELECT
statement in which I compare each document to every other document via a self-join.
WITH input_data AS (
SELECT
document_id
FROM UNNEST(["doc1", "doc2"]) AS document_id
),
exploded AS (
SELECT
document_id,
STRUCT(dimension_id, rand() AS weight) AS weight
FROM
input_data CROSS JOIN UNNEST(GENERATE_ARRAY(1, 3)) AS dimension_id
),
rows_with_weights AS (
SELECT
document_id,
ARRAY_AGG(weight) as vector
FROM
exploded
GROUP BY
document_id
),
SELECT
l.document_id as l_doc_id,
r.document_id as r_doc_id,
l.vector as l_vectors,
r.vector as r_vectors
FROM
rows_with_weights AS l
CROSS JOIN rows_with_weights AS r WHERE l.document_id <= r.document_id
You'll see that I also compare each document to itself, and here's where something unexpected happens: when a document is compared to itself, the values of l_vector
are distinct from those of r_vector
. Here's the output of the above query:
This is puzzling. I would expect these two vectors to be the same for the same document. What's going on here?
Upvotes: 1
Views: 105
Reputation: 173046
What's going on here?
This is as expected!
BigQuery does not materialize the results of CTEs. If CTE is referenced in multiple places in a query, then the CTE is executed once for each reference. Which means that when you do self join you actually join two tables with different weights. Thus the result
I recommend to use temp tables instead so randomness will be materialized once and then you will get expected result
Upvotes: 2