Efficiently outer join two array columns row-wise in BigQuery table

Question

I'll first state the question as simply as possible, then elaborate with more detail and an example.

Concise question without context

I have a table with rows containing columns of arrays. I need to outer join the elements of some pairs of these, compute some variables, and then aggregate the results back into a new array. I'm currently using a pattern where I:

unnest each column in the pair to be joined (cross join to PK of row)
full outer join the two on the PK and compute desired fields
group by PK to get back to single row with array column that summarizes the results

Is there a way to do this without the multiple unnesting and grouping back down?

More context and an example

I have a table which represents edits to an entity that is made up of multiple sub-records. Each row represents a single entity. There is a column before that contains the records before the edit, and another after that contains the records afterwards.

My goal is to label each sub-record with exactly one of the four valid edit types:

DELETE - record exists in before but not after
ADD - record exists in after but not before
EDIT - record exists in both before and after but any field was changed
NONE - record exists in both before and after and no fields were changed

Each of the sub-record values is represented by its ID and a hash of all of its fields. I've created some fake data and provided my initial implementation below. This works, but it seems very roundabout.

WITH source_data AS (
  SELECT
    1 AS pkField,
    [
      STRUCT(1 AS id, 1 AS fieldHash),
      STRUCT(2 AS id, 2 AS fieldHash),
      STRUCT(3 AS id, 3 AS fieldHash)
    ] AS before,
    [
      STRUCT(1 AS id, 1 AS fieldHash),
      STRUCT(2 AS id, 0 AS fieldHash),  -- record 2 edited
      -- record 3 deleted
      STRUCT(4 AS id, 4 AS fieldHash), -- record 4 added
      STRUCT(5 AS id, 5 AS fieldHash)  -- record 5 added
    ] AS after
)
SELECT
  pkField,
  ARRAY_AGG(STRUCT(
    id,
    CASE
      WHEN beforeHash IS NULL THEN "ADD"
      WHEN afterHash IS NULL THEN "DELETE"
      WHEN beforeHash <> afterHash THEN "EDIT"
      ELSE "NONE"
    END AS editType
  )) AS edits
FROM (
  SELECT pkField, id, fieldHash AS beforeHash
  FROM source_data
  CROSS JOIN UNNEST(source_data.before)
)
FULL OUTER JOIN (
  SELECT pkField, id, fieldHash AS afterHash
  FROM source_data
  CROSS JOIN UNNEST(source_data.after)
)
USING (pkField, id)
GROUP BY pkField

Is there a simpler and/or more efficient way to do this? Perhaps something that avoids the multiple unnesting and grouping back down?

Mikhail Berlyant · Accepted Answer

I think, what you have is already simple and efficient way!
Meantime, you can consider below optimized version

select pkField, 
  array(select struct(
      id, case
      when b.fieldHash is null then 'ADD'
      when a.fieldHash is null then 'DELETE'
      when b.fieldHash != a.fieldHash then 'EDIT'
      else 'NONE'
      end as editType
    ) edits
    from (select id, fieldHash from t.before) b
    full outer join (select id, fieldHash from t.after) a
    using(id)
  ) edits
from source_data t

if applied to sample data in your question - output is

Efficiently outer join two array columns row-wise in BigQuery table

Answers (1)

Related Questions