Reputation: 222
I am trying to better understand BigQuery queryPlan, and try to establish data lineage at a column-level using that queryPlan.
Everything started fine, READ
, COMPUTE
, etc... are pretty understandable, then I crossed a JOIN
step. Using a simple query that will join two tables on a single column, and output columns from both tables:
CREATE OR REPLACE TABLE wrk_dataset.t1 AS ( select 1 as t1f1 );
CREATE OR REPLACE TABLE wrk_dataset.t2 AS ( select 1 as t2f1 );
SELECT t1f1 AS t3f1,
t1f1 AS t3f2,
t2f1 AS t3f3,
FROM wrk_dataset.t1
JOIN wrk_dataset.t2
ON t1f1 = t2f1
I got those steps in my queryPlan:
READ $10:t2f1
FROM wrk_dataset.t2
WRITE $10
TO __stage00_output
READ $1:t1f1
FROM wrk_dataset.t1
READ $10
FROM __stage00_output
JOIN INNER HASH JOIN EACH WITH ALL ON $1 = $10
WRITE $20, $21
TO __stage02_output
So at first, everything is fine,
$10
links to wrk_dataset.t2.t2f1
$1
links to wrk_dataset.t1.t1f1
$1
and $10
(so wrk_dataset.t1.t1f1
and wrk_dataset.t2.t2f1
)And then it becomes complicated: that step outputs $20
and $21
, but what are they the product of? Also in my SELECT
, I asked for 3 columns (with 2 of them being identical), so I can guess that $20
is probably t1f1
and $21
is t2f1
but only by looking at the SELECT
.
Is it not possible to know the result of a query just by looking at its plan like other framework?
So, apparently just a simple READ
is harder to understand that I thought.
SELECT id, id FROM ( select 1 as id )
Give me that queryPlan:
COMPUTE $1 := 1
WRITE $1
TO __stage00_output
So I guess I cannot determinate the schema of the table that will be the result of my query by looking only at the queryPlan?
Upvotes: 0
Views: 142
Reputation: 1955
Its not possible by the query plan. The $variables
sent to the last step TO __stage00_output
contains the column data to be sent to the output. If it is repeated, it will not be sent multiple times. It will just be showed multiple times in the final output (as there no meaning to pass same information multiple times from one step to another.)
if there is no repeated columns, it can be considered a representation of output schema, but better not use it. The query plan is meant to be used to troubleshooting and identify bottlenecks on the queries by monitoring its execution time ans so on..
Upvotes: 1