Bigquery query plan produce unknown output

Question

I am trying to better understand BigQuery queryPlan, and try to establish data lineage at a column-level using that queryPlan.

Everything started fine, READ, COMPUTE, etc... are pretty understandable, then I crossed a JOIN step. Using a simple query that will join two tables on a single column, and output columns from both tables:

CREATE OR REPLACE TABLE wrk_dataset.t1 AS ( select 1 as t1f1 );
CREATE OR REPLACE TABLE wrk_dataset.t2 AS ( select 1 as t2f1 );

SELECT t1f1 AS t3f1,
       t1f1 AS t3f2, 
       t2f1 AS t3f3,
FROM   wrk_dataset.t1
JOIN   wrk_dataset.t2
  ON   t1f1 = t2f1

I got those steps in my queryPlan:

S00: Input

READ    $10:t2f1
        FROM wrk_dataset.t2

WRITE   $10
        TO __stage00_output

S02: Output

READ    $1:t1f1
        FROM wrk_dataset.t1

READ    $10
        FROM __stage00_output

JOIN    INNER HASH JOIN EACH WITH ALL ON $1 = $10

WRITE   $20, $21
        TO __stage02_output

So at first, everything is fine,

$10 links to wrk_dataset.t2.t2f1
$1 links to wrk_dataset.t1.t1f1
the join is a inner hash comparing $1 and $10 (so wrk_dataset.t1.t1f1 and wrk_dataset.t2.t2f1)

And then it becomes complicated: that step outputs $20 and $21, but what are they the product of? Also in my SELECT, I asked for 3 columns (with 2 of them being identical), so I can guess that $20 is probably t1f1 and $21 is t2f1 but only by looking at the SELECT.

Is it not possible to know the result of a query just by looking at its plan like other framework?

EDIT

So, apparently just a simple READ is harder to understand that I thought.

SELECT id, id FROM ( select 1 as id )

Give me that queryPlan:

COMPUTE     $1 := 1
WRITE       $1
            TO __stage00_output

So I guess I cannot determinate the schema of the table that will be the result of my query by looking only at the queryPlan?

ewertonvsilva · Accepted Answer

Its not possible by the query plan. The $variables sent to the last step TO __stage00_output contains the column data to be sent to the output. If it is repeated, it will not be sent multiple times. It will just be showed multiple times in the final output (as there no meaning to pass same information multiple times from one step to another.)

if there is no repeated columns, it can be considered a representation of output schema, but better not use it. The query plan is meant to be used to troubleshooting and identify bottlenecks on the queries by monitoring its execution time ans so on..

Bigquery query plan produce unknown output

S00: Input

S02: Output

EDIT

Answers (1)

Related Questions