ningl
ningl

Reputation: 329

dbt macro: How to join tables on multiple columns in a loop

I'm writing a dbt model to join two tables on multiple columns, compiled to something like this:

SELECT 
  A.col1,
  A.col2,
  A.col3,
FROM
   A
LEFT JOIN
   B
ON
   (A.col1 = B.col1 OR (IS_NAN(A.col1) AND IS_NAN(B.col1))
   AND (A.col2 = B.col2 OR (IS_NAN(A.col2) AND IS_NAN(B.col2))
   AND (A.col3 = B.col3 OR (IS_NAN(A.col3) AND IS_NAN(B.col3))

and this logic will be applied to many table pairs, so I need a macro. The joining logic is the same on all columns, so a loop over columns in the ON clause would be perfect, like this

SELECT 
  {% for col in all_cols %}
    A.{{ col }},
  {% endfor %}
FROM
   A
LEFT JOIN
   B
ON
    {% for col in all_cols %}
        (A.{{col}} = B.{{col}} OR (IS_NAN(A.{{col}}) AND IS_NAN(B.{{col}})),
       <-- What to put here for AND the next condition???
    {% endfor %}

How can I concatenate the conditions in ON clause with AND when iterating over columns?

Upvotes: 2

Views: 1989

Answers (2)

Samuel
Samuel

Reputation: 3528

Your sample query is missing several ) in the on statement.

Since you asked for BigQuery, I show here a route to do the task directly with BigQuery without using the dbt tool.

First generate a dataset Test and two tables A and B

CREATE OR REPLACE TABLE
  Test.B AS
SELECT
IF
  (RAND()>0.5,NULL,RAND()) col1,
IF
  (RAND()>0.5,NULL,RAND()) col2,
IF
  (RAND()>0.5,NULL,RAND()) col3
FROM
  UNNEST(GENERATE_ARRAY(1,100)) a

Then run this query in the same region, has to be set manually if not US.

DECLARE col_list ARRAY<STRING> ;
DECLARE col_list_A STRING ;
DECLARE col_list_B STRING ;
DECLARE col_list_on STRING ;
EXECUTE IMMEDIATE
  "Select array_agg(column_name) from Test.INFORMATION_SCHEMA.COLUMNS where TABLE_NAME='A'" INTO col_list;

EXECUTE IMMEDIATE
"Select STRING_AGG(concat('A.',cols)) FROM UNNEST(?) cols" INTO col_list_A USING col_list;

EXECUTE IMMEDIATE
"Select STRING_AGG(concat('B.',cols)) FROM UNNEST(?) cols" INTO col_list_B USING col_list;

EXECUTE IMMEDIATE
"Select STRING_AGG(concat('(A.',cols,' = B.',cols,' OR (IS_NAN(A.',cols,') AND IS_NAN(B.',cols,')) ) '),' AND ') FROM UNNEST(?) cols" INTO col_list_on USING col_list;

EXECUTE IMMEDIATE
"SELECT " || col_list_A || "," || col_list_B || " FROM Test.A LEFT JOIN Test.B ON " || col_list_on

First DECLARE all variables. Then query the column name of table A to variable col_list. Use concat to build the A.col1, A.col2 ... list and then for B. as well. The concat is again used for the ON conditions.

Finally all variables are put into query.

I would like to warn that this final query will perform poor on larger tables. In cases this is an issue for you, please fell free to ask another question, given more details about your goal.

Upvotes: -1

tconbeer
tconbeer

Reputation: 5805

The cute way (add a predicate that is always true, so you can start every statement with AND):

SELECT 
  {% for col in all_cols %}
    A.{{ col }},
  {% endfor %}
FROM
   A
LEFT JOIN
   B
ON
    1=1
    {% for col in all_cols %}
        AND (A.{{col}} = B.{{col}} OR (IS_NAN(A.{{col}}) AND IS_NAN(B.{{col}})))
    {% endfor %}

The less-cute way, using loop.first (loop is a variable set by jinja inside a for loop that has some handy properties. loop.first and loop.last are especially useful):

SELECT 
  {% for col in all_cols %}
    A.{{ col }},
  {% endfor %}
FROM
   A
LEFT JOIN
   B
ON
    {% for col in all_cols %}
        {% if not loop.first %}AND{% endif %} (A.{{col}} = B.{{col}} OR (IS_NAN(A.{{col}}) AND IS_NAN(B.{{col}})))
    {% endfor %}

Upvotes: 2

Related Questions