Mengard
Mengard

Reputation: 222

Bigquery query plan produce unknown output

I am trying to better understand BigQuery queryPlan, and try to establish data lineage at a column-level using that queryPlan.

Everything started fine, READ, COMPUTE, etc... are pretty understandable, then I crossed a JOIN step. Using a simple query that will join two tables on a single column, and output columns from both tables:

CREATE OR REPLACE TABLE wrk_dataset.t1 AS ( select 1 as t1f1 );
CREATE OR REPLACE TABLE wrk_dataset.t2 AS ( select 1 as t2f1 );

SELECT t1f1 AS t3f1,
       t1f1 AS t3f2, 
       t2f1 AS t3f3,
FROM   wrk_dataset.t1
JOIN   wrk_dataset.t2
  ON   t1f1 = t2f1

I got those steps in my queryPlan:

S00: Input

READ    $10:t2f1
        FROM wrk_dataset.t2

WRITE   $10
        TO __stage00_output

S02: Output

READ    $1:t1f1
        FROM wrk_dataset.t1

READ    $10
        FROM __stage00_output

JOIN    INNER HASH JOIN EACH WITH ALL ON $1 = $10

WRITE   $20, $21
        TO __stage02_output 

So at first, everything is fine,

And then it becomes complicated: that step outputs $20 and $21, but what are they the product of? Also in my SELECT, I asked for 3 columns (with 2 of them being identical), so I can guess that $20 is probably t1f1 and $21 is t2f1 but only by looking at the SELECT.

Is it not possible to know the result of a query just by looking at its plan like other framework?

EDIT

So, apparently just a simple READ is harder to understand that I thought.

SELECT id, id FROM ( select 1 as id )

Give me that queryPlan:

COMPUTE     $1 := 1
WRITE       $1
            TO __stage00_output 

So I guess I cannot determinate the schema of the table that will be the result of my query by looking only at the queryPlan?

Upvotes: 0

Views: 142

Answers (1)

ewertonvsilva
ewertonvsilva

Reputation: 1955

Its not possible by the query plan. The $variables sent to the last step TO __stage00_output contains the column data to be sent to the output. If it is repeated, it will not be sent multiple times. It will just be showed multiple times in the final output (as there no meaning to pass same information multiple times from one step to another.)

if there is no repeated columns, it can be considered a representation of output schema, but better not use it. The query plan is meant to be used to troubleshooting and identify bottlenecks on the queries by monitoring its execution time ans so on..

Upvotes: 1

Related Questions