Reputation: 329
I'm writing a dbt model to join two tables on multiple columns, compiled to something like this:
SELECT
A.col1,
A.col2,
A.col3,
FROM
A
LEFT JOIN
B
ON
(A.col1 = B.col1 OR (IS_NAN(A.col1) AND IS_NAN(B.col1))
AND (A.col2 = B.col2 OR (IS_NAN(A.col2) AND IS_NAN(B.col2))
AND (A.col3 = B.col3 OR (IS_NAN(A.col3) AND IS_NAN(B.col3))
and this logic will be applied to many table pairs, so I need a macro. The joining logic is the same on all columns, so a loop over columns in the ON clause would be perfect, like this
SELECT
{% for col in all_cols %}
A.{{ col }},
{% endfor %}
FROM
A
LEFT JOIN
B
ON
{% for col in all_cols %}
(A.{{col}} = B.{{col}} OR (IS_NAN(A.{{col}}) AND IS_NAN(B.{{col}})),
<-- What to put here for AND the next condition???
{% endfor %}
How can I concatenate the conditions in ON clause with AND
when iterating over columns?
Upvotes: 2
Views: 1989
Reputation: 3528
Your sample query is missing several )
in the on
statement.
Since you asked for BigQuery, I show here a route to do the task directly with BigQuery without using the dbt
tool.
First generate a dataset Test
and two tables A
and B
CREATE OR REPLACE TABLE
Test.B AS
SELECT
IF
(RAND()>0.5,NULL,RAND()) col1,
IF
(RAND()>0.5,NULL,RAND()) col2,
IF
(RAND()>0.5,NULL,RAND()) col3
FROM
UNNEST(GENERATE_ARRAY(1,100)) a
Then run this query in the same region, has to be set manually if not US.
DECLARE col_list ARRAY<STRING> ;
DECLARE col_list_A STRING ;
DECLARE col_list_B STRING ;
DECLARE col_list_on STRING ;
EXECUTE IMMEDIATE
"Select array_agg(column_name) from Test.INFORMATION_SCHEMA.COLUMNS where TABLE_NAME='A'" INTO col_list;
EXECUTE IMMEDIATE
"Select STRING_AGG(concat('A.',cols)) FROM UNNEST(?) cols" INTO col_list_A USING col_list;
EXECUTE IMMEDIATE
"Select STRING_AGG(concat('B.',cols)) FROM UNNEST(?) cols" INTO col_list_B USING col_list;
EXECUTE IMMEDIATE
"Select STRING_AGG(concat('(A.',cols,' = B.',cols,' OR (IS_NAN(A.',cols,') AND IS_NAN(B.',cols,')) ) '),' AND ') FROM UNNEST(?) cols" INTO col_list_on USING col_list;
EXECUTE IMMEDIATE
"SELECT " || col_list_A || "," || col_list_B || " FROM Test.A LEFT JOIN Test.B ON " || col_list_on
First DECLARE
all variables. Then query the column name of table A
to variable col_list
. Use concat
to build the A.col1, A.col2 ...
list and then for B.
as well. The concat
is again used for the ON
conditions.
Finally all variables are put into query.
I would like to warn that this final query will perform poor on larger tables. In cases this is an issue for you, please fell free to ask another question, given more details about your goal.
Upvotes: -1
Reputation: 5805
The cute way (add a predicate that is always true, so you can start every statement with AND
):
SELECT
{% for col in all_cols %}
A.{{ col }},
{% endfor %}
FROM
A
LEFT JOIN
B
ON
1=1
{% for col in all_cols %}
AND (A.{{col}} = B.{{col}} OR (IS_NAN(A.{{col}}) AND IS_NAN(B.{{col}})))
{% endfor %}
The less-cute way, using loop.first
(loop
is a variable set by jinja inside a for loop that has some handy properties. loop.first
and loop.last
are especially useful):
SELECT
{% for col in all_cols %}
A.{{ col }},
{% endfor %}
FROM
A
LEFT JOIN
B
ON
{% for col in all_cols %}
{% if not loop.first %}AND{% endif %} (A.{{col}} = B.{{col}} OR (IS_NAN(A.{{col}}) AND IS_NAN(B.{{col}})))
{% endfor %}
Upvotes: 2